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Abstract 



We try to establish a unified information theoretic approach to learning and to ex- 
plore some of its applications. First, we define predictive information as the mutual 
information between the past and the future of a time series, discuss its behav- 
ior as a function of the length of the series, and explain how other quantities of 
interest studied previously in learning theory — as well as in dynamical systems 
and statistical mechanics — emerge from this universally definable concept. We 
then prove that predictive information provides the unique measure for the com- 
plexity of dynamics underlying the time series and show that there are classes 
of models characterized by power-law growth of the predictive information that are 
qualitatively more complex than any of the systems that have been investigated 
before. Further, we investigate numerically the learning of a nonparametric prob- 
ability density, which is an example of a problem with power-law complexity, 
and show that the proper Bayesian formulation of this problem provides for the 
'Occam' factors that punish overly complex models and thus allow one to learn 
not only a solution within a specific model class, hut also the class itself using the data 
only and with very few a priori assumptions. We study a possible information 
theoretic method that regularizes the learning of an undersampled discrete vari- 
able, and show that learning in such a setup goes through stages of very different 
complexities. Finally, we discuss how all of these ideas may be useful in various 
problems in physics, statistics, and, most importantly, biology. 
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"All of the books in the world contain no more information than is broadcast as 
video in a single large American city in a single year. Not all bits have equal 
value." 

Carl SaganQ 

"My interest is in the future because I am going to spend the rest of my life 
there." 

Charles R Kettering 

"That is what learning is. You suddenly understand something you've 
understood all your life, but in a new way." 
Doris Lessing 

"Learning is not compulsory. . . Neither is survival." 
W. Edwards Deming 

"Where is the knowledge we've lost in information?" 
T. S. Eliot 

"What most experimenters take for granted before they begin their experiments 
is infinitely more interesting than any results to which their experiments lead." 
Norbert Wiener 



^ All quotations shown on this page can be found at the electronic archive 
http://www.starlrngtech.com/quotes/ 



Chapter 1 



Introduction: what do we know? 



We hope that while reading this work our readers will unsurprisingly realize that 
they actually are learning something. However, what may come as a surprise is 
that they learn a lot more than they think: while reading this very sentence the 
photoreceptors in the eyes estimate the mean intensity of the ambient light and 
adapt to it; the auditory cortex monitors the surroundings and warns if a visi- 
tor knocks on the door. The reader skips the endings of some long, complicated 
words because he has already guessed what is coming; he then notices peculiar- 
ities in the stylistics of the text and soon learns to distinguish sentences written 
late at night. And then, finally, there is the "true" learning of the thoughts that 
the authors try to convey in their writing. 

Learning is everywhere around and inside us, and it is absolutely essential 
for our second-to-second survival. In fact, because of its utmost importance and 
omnipresence each one of us has a well developed personal, unique intuition 
on what "learning" means, and how it works. One might think that such enor- 
mous experience would come in handy when studying learning from a scientific 
perspective, but the situation is quite the opposite: it is extremely difficult to 
build a theory that unites the enormous spectrum of possible learning problems. 



Intuition built up for the case of learning to play a musical instrument may be 
totally useless (and even destructive) for studying, for example, how we learn 
our first language, or master mathematical concepts. A multitude of ideas and 
approaches, each treating its specific problem and having only a slight relation to 
another, is indeed what we see in learning science now. 

In fact, there even is no such thing as the "learning theory." There is statistical 
learning theory, which builds probabilistic bounds on our ability to estimate the 
parameters of models that describe some observations, and its formalism seems 
completely disjoint from the designs of psychological and physiological exper- 
iments that study learning in humans and animals. Then there is the Minimal 
Description Length paradigm, which states that the shorter is the code for a set 
of samples, the better is the knowledge of the structure inside the samples; it is 
not clear how to connect these ideas to numerous learning curves defined in spe- 
cific contexts of neural networks. Then there are ideas that since the speed or 
(conversely) the difficulty of learning is related intuitively to the complexity of 
the studied problem, learning and complexity should be studied together; this 
opens the Pandora box of different approaches to complexity (later in this work 
we list over a dozen of definitions of this quantity!) and does not even come close 
to quantifying learning and complexity of, say, some simple geometric concept. 
We can continue this list, but the point is clear. We believe that specific learning 
scenarios, however interesting and practical they may be, are not going to bring 
any more insight to our current understanding of learning (and, for that matter, 
complexity). What we need at this stage is not another example — there are too 
many of them to comprehend already — ^but a unifying, generalizing theory. 

What do we expect from such a theory? We want it to be physical in its spirit. 



That is, it must explain and unify all accumulated knowledge of the subject (and 
thus necessarily have an element of a review), but this explanation should bring 
a new level of understanding to the old problems, a level from which all the 
problems appear as different realizations of one general phenomenon. However, 
explaining old data is just a half of a good theory Using new tools we must also 
be able to ask and answer meaningful new questions, thus the theory should be 
constructive enough to serve as a kernel for development. 

We build our presentation to address all of these questions. In Chapter § we 
introduce a version of the theory of learning and complexity which is built on 
information theory and the notion of predictability. After finishing the construc- 
tion, we extensively analyze the literature to show that most of prior knowledge 
of the subject is subsumed in our more general approach. Then we try to show 
that the ideas do not only explain the old results but can be used to study new 
problems as well. For this, we discuss a broad spectrum of possible applica- 
tions to physics, to computer science, and to biology, and then single out two 
examples for a detailed analysis. In Chapter ^ we study applications of our ideas 
to the learning of nonparametric continuous probability densities, and we show 
how complexity penalizing Occam factors work in this case. Then in Chapter ^ 
we turn to the seemingly easier problem of learning a probability distribution of 
a discrete variable, and we study how regularization based only on information 
theory makes learning possible in the undersampled regime. 

One may argue that the examples we discuss are not enough to claim for cer- 
tain that our theory indeed is constructive. We hope to resolve these fears in the 
nearest future by studying other possible applications that we mention through- 
out our work. However, we want to stress here explicitly that we believe that the 



theory itself is complete, the definitions that we make are sensible and unique, 
and the conclusions are general and universal. 



Chapter 2 



Predictability, Complexity, and 

Learning 



2.1 Why study predictability? 

There is an obvious interest in having practical algorithms for predicting the fu- 
ture, and there is a correspondingly large literature on the problem of time series 
extrapolation.^ But prediction is both more and less than extrapolation: we might 
be able to predict, for example, the chance of rain in the coming week even if 
we cannot extrapolate the trajectory of temperature fluctuations. In the spirit of 
its thermodynamic origins, information theory (Shannon 1948) characterizes the 
potentialities and limitations of all possible prediction algorithms, as well as uni- 
fying the analysis of extrapolation with the more general notion of predictability. 

Specifically, we can define a quantity — the predictive information — that measures 

^The classic papers are by Kolmogoroff (1939, 1941) and Wiener (1949), who essentially solved 
all the extrapolation problems that could be solved by linear methods. Our understanding of pre- 
dictability was changed by developments in dynamical systems, which showed that apparently 
random (chaotic) time series could arise from simple deterministic rules, and this led to vigorous 
exploration of nonlinear extrapolation algorithms (Abarbanel et al. 1993). For a review com- 
paring different approaches, see the conference proceedings edited by Weigend and Gershenfeld 
(1994). 
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how much our observations of the past can tell us about the future. The predictive 
information characterizes the world we are observing, and we shall see that this 
characterization is close to our intuition about the complexity of the underlying 
dynamics. 

Prediction is one of the fundamental problems in neural computation. Much 
of what we admire in expert human performance is predictive in character — the 
point guard who passes the basketball to a place where his teammate will arrive 
in a split second, the chess master who knows how moves made now will influ- 
ence the end game two hours hence, the investor who buys a stock in anticipation 
that it will grow in the year to come. More generally, we gather sensory informa- 
tion not for its own sake but in the hope that this information will guide our 
actions (including our verbal actions). But acting takes time, and sense data can 
guide us only to the extent that those data inform us about the state of the world 
at the time of our actions, so the only components of the incoming data that have 
a chance of being useful are those that are predictive. Put bluntly, nonpredictive 
information is useless to the organism, and it therefore makes sense to isolate the 
predictive information. It will turn out that most of the information we collect 
over a long period of time is nonpredictive, so that isolating the predictive infor- 
mation must go a long way toward separating out those features of the sensory 
world that are relevant for behavior. 

One of the most important examples of prediction is the phenomenon of gen- 
eralization in learning. Learning is formalized as finding a model that explains 
or describes a set of observations, but again this is useful precisely (and only) be- 
cause we expect this model will continue to be valid: in the language of learning 
theory [see, for example, Vapnik (1998)] an animal can gain selective advantage 
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not from its performance on the training data but only from its performance at 
generalization. Generalizing — and not "overfitting" the training data — is pre- 
cisely the problem of isolating those features of the data that have predictive 
value (see also Bialek and Tishby, in preparation). Further, we know that the 
success of generalization hinges on controlling the complexity of the models that 
we are willing to consider as possibilities. Finally, learning a model to describe 
a data set can be seen as an encoding of those data, as emphasized by Rissanen 
(1989), and the quality of this encoding can be measured using the ideas of in- 
formation theory. Thus the exploration of learning problems should provide us with 
explicit links among the concepts of entropy, predictability, and complexity. 

The notion of complexity arises not only in learning theory, but also in several 
other contexts. Some physical systems exhibit more complex dynamics than oth- 
ers (turbulent vs. laminar flows in fluids), and some systems evolve toward more 
complex states than others (spin glasses vs. ferromagnets). The problem of char- 
acterizing complexity in physical systems has a substantial literature of its own 
[for an overview see Bennett (1990)]. In this context several authors have con- 
sidered complexity measures based on entropy or mutual information, although 
as far as we know no clear connections have been drawn among the measures of 
complexity that arise in learning theory and those that arise in dynamical systems 
and statistical mechanics. 

An essential difficulty in quantifying complexity is to distinguish complexity 
from randomness. A true random string cannot be compressed and hence re- 
quires a long description; it thus is complex in the sense defined by Kolmogorov 
(1965, Li and Vitanyi 1993, Vitanyi and Li 2000), yet the physical process that 
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generates this string may have a very simple description. Both in statistical me- 
chanics and in learning theory our intuitive notions of complexity correspond to 
the statements about complexity of the underlying process, and not directly to 
the description length or Kolmogorov complexity. 

Our central result is that the predictive information provides a general measure of 
complexity which includes as special cases some relevant concepts from learning 
theory and from dynamical systems. While the work on the complexity of mod- 
els in learning theory rests specifically on the idea that one is trying to infer a 
model from data, the predictive information is a property of the data (or, more 
precisely, of an ensemble of data) itself without reference to a specific class of 
underlying models. If the data are generated by a process in a known class but 
with unknown parameters, then we can calculate the predictive information ex- 
plicitly and show that this information diverges logarithmically with the size of the 
data set we have observed; the coefficient of this divergence counts the number of 
parameters in the model, or more precisely the effective dimension of the model 
class, and this provides a link to known results of Rissanen and others. But our 
approach also allows us to quantify the complexity of processes that fall outside 
the finite dimensional models of conventional learning theory, and we show that 
these more complex processes are characterized by a power-law rather than a logarithmic 
divergence of the predictive information. 

By analogy with the analysis of critical phenomena in statistical physics, the 
separation of logarithmic from power-law divergences, together with the mea- 
surement of coefficients and exponents for these divergences, allows us to define 
"universality classes" for the complexity of data streams. The power-law or non- 
parametric class of processes may be crucial in real world learning tasks, where 
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the effective number of parameters becomes so large that asymptotic results for 
finitely parameterizable models are inaccessible in practice. There is empirical 
evidence that simple physical systems can generate dynamics in this complexity 
class, and there are hints that language also may fall in this class. 

Finally, we argue that the divergent components of the predictive information pro- 
vide a unique measure of complexity that is consistent with certain simple require- 
ments. This argument is in the spirit of Shannon's original derivation of entropy 
as the unique measure of available information. We believe that this uniqueness 
argument provides a conclusive answer to the question of how one should quan- 
tify the complexity of a process generating a time series. 

With the evident cost of lengthening our discussion, we have tried to give a 
self-contained presentation that develops our point of view, uses simple exam- 
ples to connect with known results, and then generalizes and goes beyond these 
results]^ Even in cases where at least the qualitative form of our results is known 
from previous work, we believe that our point of view elucidates some issues 
that may have been less the focus of earlier studies. Last but not least, we explore 
the possibilities for connecting our theoretical discussion with the experimental 
characterization of learning and complexity in neural systems. 

2.2 A curious observation 

Before starting the systematic analysis of the problem, we want to motivate our 

discussion further by presenting results of some simple numerical experiments. 

^Some of the basic ideas presented here, together with some connections to ear her work, can 
be found in brief preliminary reports (Bialek 1995; Bialek and Tishby 1999). The central results of 
the present work, however, were at best conjectures in these preliminary accounts. 
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Figure 2.1: Calculating entropy of words of length 4 in a chain of 17 spins. For 
this chain, n{Wo) = n{Wi) = niW^) = niWr) = niWu) = n{Wu) = 2, niWs) = 
n{Wg) = 1, and all other frequencies are zero. Thus, 5'(4) ^ 2.95 bits. 

Suppose we have a 1-dimensional chain of Istng spins with the Hamiltonian 
given by 



H 



E ^ij^i 



0"; 



(2.1) 



1.J 



where the matrix Jy is not necessarily tridiagonal (that is, long range interac- 
tions are also allowed). One may identify spins pointing upwards with 1 and 
downwards with 0, and then a spin chain is equivalent to some sequence of bi- 
nary digits. This sequence consists of (overlapping) words of A^ digits each, W]^, 
k = 0,1 ■ ■ -2^ — 1. Even though there are 2^ such words total, they appear with 
very different frequencies n{Wk) in the spin chain [see Fig. (p]) for details]. If the 
number of spins is large, then counting these frequencies provides a good empir- 
ical estimate to PN{Wk), the probability distribution of different words of length 
A^. Then one can calculate the entropy S{N) of this probability distribution by 
the usual formula 



2™-l 



SiN) = - Y. P^iWk) log2 PN{Wk) (bits). 



(2.2) 



fc=0 



Since entropy is an extensive property, S{N) is asymptotically proportional to 
A^ for any spin chain. Choosing a different set of couplings Jij may change the 
coefficient of proportionality (and finding this coefficient is usually the goal of 
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Figure 2.2: Entropy as a function of the word length for spin chains with different 
interactions. Notice that all lines start from ^(A^) = loga 2 = 1 since at the values 
of the coupling we investigated the correlation length is much smaller than the 
chain length (1 ■ 10^ spins). 



statistical mechanics) but the linearity is never challenged. 

We investigated this in three different spin chains of one billion spins each (the 
temperature is always fc^T = 1). For the first chain, only Ji,i+i = 1 was nonzero, 
and its value was the same for all i's. The second chain was also generated using 
the nearest neighbor interactions, but the value of the coupling was reinitialized 
every 400,000 spins by taking a random number from a Gaussian distribution 
with a zero mean and a unit variance. In the third case, we again reinitialized 
at the same frequency, but now interactions were long-ranged, and the variance 
of coupling constants decreased with the distance between the spins as ( Jy) = 
l/(i — j)^. We plotted S{N) for all these cases in Fig. (P^, and, of course, the 
asymptotically linear behavior seems to be evident — the extensive entropy shows 
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Figure 2.3: Subextensive part of the entropy as a function of the word length. 



no qualitative distinction between the three cases we consider. 

However, the situation changes drastically if we remove the asymptotic linear 
contribution and plot only the sub linear component 5*1 ( A^) of the entropy. As we 
see in Fig. ( |Z3|) , the three investigated chains then exhibit qualitatively different 
features: for the first one, 5*1 is constant; for the second one, it is logarithmic; and, 
for the third one, it clearly shows a power-law behavior. 

What is the significance of this observation? Of course, the differences must 
be related to the ways we chose J's for the simulations. In the first case, J is 
fixed, and there is not much one can learn from observing the spin chain. For 
the second chain, J changes, and the statistics of the spin-words is different in 
different parts of the sequence. By looking at this statistics, one can thus esti- 
mate coupling at the current position. Finally, in the third case there are many 
coupling constants that can be learned. In principle, as A^ increases one becomes 
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sensitive to correlations caused by interactions over larger and larger distances, 
and, since the variance of the couplings decays with the distance, interactions of 
longer range do not interfere with learning short-scale properties. So, intuitively, 
the qualitatively different behavior of Si{N) for the three plotted cases is due to 
a different character of learning tasks involved in understanding the spin chains. 
Much of this Chapter can be seen as expanding on and quantifying this intuitive 
observation. 

2.3 Fundamentals 

The problem of prediction comes in various forms, as noted above. Information 
theory allows us to treat the different notions of prediction on the same footing. 
The first step is to recognize that all predictions are probabilistic — even if we can 
predict the temperature at noon tomorrow, we should provide error bars or confi- 
dence limits on our prediction. The next step is to remember that, even before we 
look at the data, we know that certain futures are more likely than others, and we 
can summarize this knowledge by a prior probability distribution for the future. 
Our observations on the past lead us to a new, more tightly concentrated distri- 
bution, the distribution of futures conditional on the past data. Different kinds 
of predictions are different slices through or averages over this conditional distri- 
bution, but information theory quantifies the "concentration" of the distribution 
without making any commitment as to which averages will be most interesting. 
Imagine that we observe a stream of data x{t) over a time interval — T < t < 0; 

let all of these past data be denoted by the shorthand Xpast- We are interested 

^Note again that we are dealing here with subextensive properties of systems. These are the 
properties that are ignored in most problems in statistical mechanics. 
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in saying something about the future, so we want to know about the data x{t) 
that will be observed in the time interval < i < T'; let these future data be 
called Xfuturc- In the absence of any other knowledge, futures are drawn from the 
probability distribution P(xfuturc)/ while observations of particular past data Xpast 
tell us that futures will be drawn from the conditional distribution P (xfuture | a;past ) • 
The greater concentration of the conditional distribution can be quantified by the 
fact that it has smaller entropy than the prior distribution, and this reduction in 
entropy is Shannon's definition of the information that the past provides about 
the future. We can write the average of this predictive information as 

-* I, -^future I -^past ) 



Ipred{T,T') = ( log 



-'l^ -^future/ 

-(log2P(a;future)) " (logs P(a;past)) 



(2.3) 



-[-(log2^(a;future,a;past))] , (2.4) 

where (■ ■ ■) denotes an average over the joint distribution of the past and the 

future, P(Xfuture, a^past). 

Each of the terms in Eq. ( |2.4|) is an entropy. Since we are interested in pre- 



dictability or generalization, which are associated with some features of the signal 
persisting forever, we may assume stationarity or invariance under time transla- 
tions. Then the entropy of the past data depends only on the duration of our 
observations, so we can write — (log2 P(xpast)) = 'S'(r), and by the same argument 

— (log2 P(xfutm-c)) = S{T'). Finally, the entropy of the past and the future taken 
together is the entropy of observations on a window of duration T + T', so that 

— (log2 P(a;future, a^past)) = S{T + T'). Putting these equations together, we obtain 

Jpred(T, T') = S(T) + S{T') - Sir + T'). (2.5) 
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In the same way that the entropy of a gas at fixed density is proportional to 
the volume, the entropy of a time series (asymptotically) is proportional to its 
duration, so that liniT-^oo S{T)/T = Sq, entropy is an extensive quantity. But from 
Eq. (p3| ) any extensive component of the entropy cancels in the computation of 
the predictive information: ■predictability is a deviation from extensivity. If we write 
S{T) = SqT + Si{T), then Eq. (^ tells us that the predictive information is 
related only to the nonextensive term Si{T). 

We know two general facts about the behavior of ^i (T) . First, the corrections 
to extensive behavior are positive, Si{T) > 0. Second, the statement that entropy 
is extensive is the statement that the limit 

lim ^^ = So (2.6) 

exists, and for this to be true we must also have 

lim Mil = 0. (2.7) 

Thus the nonextensive terms in the entropy must be sw&extensive, that is they 
must grow with T less rapidly than a linear function. Taken together, these facts 
guarantee that the predictive information is positive and subextensive. Further, 
if we let the future extend forward for a very long time, T' — > 00, then we can 
measure the information that our sample provides about the entire future, 

/pred(T) = lim Jp,ed(r, T') = S,iT). (2.8) 

T' — >CXD 

If we have been observing a time series for a (long) time T, then the total 
amount of data we have taken in is measured by the entropy S{T), and at large 
T this is given approximately by SqT. But the predictive information that we 
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have gathered cannot grow linearly with time, even if we are making predictions 
about a future which stretches out to infinity. As a result, of the total information 
we have taken in by observing Xpast/ only a vanishing fraction is of relevance to 
the prediction: 

Predictive Information /prcd(^) 
T-^oo Total Information S{T) 

In this precise sense, most of what we observe is irrelevant to the problem of 
predicting the future. We can think of Eq. ( |2.9D as a law of diminishing returns: 
although we collect data in proportion to our observation time T, a smaller and 
smaller fraction of this information is useful in the problem of prediction. Note 
that these diminishing returns are not due to a limited lifetime, since we calculate 
the predictive information assuming that we have a future extending forward to 
infinity. 

Now consider the case where time is measured in discrete steps, so that we 
have seen N time points xi, X2, ■ ■ ■ , x^. How much have we learned about the un- 
derlying pattern in these data? The more we know, the more effectively we can 
predict the next data point x^+i and hence the fewer bits we will need to describe 
the deviation of this data point from our prediction: our accumulated knowledge 
about the time series is measured by the degree to which we can compress the 
description of new observations. On average, the length of the code word re- 
quired to describe the point xn+i, given that we have seen the previous A^ points, 
is given by 

i{N) = -(log2 P{xn+i\xi, X2,---, xn)) bits, (2.10) 

where the expectation value is taken over the joint distribution of all the A^ + 1 
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points, P{xi, X2, ■ ■ ■ , xn, xn+i)- It is easy to see that 

i{N) = S{N + 1) - S{N) ^ ^^. (2.11) 

As we observe for longer times, we learn more and this word length decreases. It 
is natural to define a learning curve that measures this improvement. Usually we 
define learning curves by measuring the frequency or costs of errors; here the cost 
is that our encoding of the point x^^i is longer than it could be if we had perfect 
knowledge. This ideal encoding has a length which we can find by imagining 
that we observe the time series for an infinitely long time, fidcai = liniAr^oo ^(N), 
but this is just another way of defining the extensive component of the entropy 
Sq. Thus we can define a learning curve 

A{N) = i{N)-£,d,^, (2.12) 

= S{N+l)-S{N)-So 



Si{N + I) - SiiN) 

dSi{N) 9Jpred(iV) 



(2.13) 



dN dN ' 

and we see once again that the extensive component of the entropy cancels. 

It is well known that the problems of prediction and compression are related, 
and what we have done here is to illustrate one aspect of this connection. Specif- 
ically, if we ask how much one segment of a time series can tell us about the 
future, the answer is contained in the subextensive behavior of the entropy. If we 
ask how much we are learning about the structure of the time series, then the nat- 
ural and universally defined learning curve is related again to the subextensive 
entropy: the learning curve is the derivative of the predictive information. 

This universal learning curve is connected to the more conventional learning 
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curves in specific contexts. As an example (cf. Section |2.4.1|) , consider fitting a set 
of data points {x^, Vn} with some class of functions y = f{x; ct), where the a are 
unknown parameters that need to be learned; we also allow for some Gaussian 
noise in our observation of the i/q. Here the natural learning curve is the evolution 
of x^ for generalization as a function of the number of examples. Within the 
approximations discussed below, it is straightforward to show that as N becomes 
large, 

{x\N)) = ^{[y-f{x;a)]')^2ln2A{N) + l, (2.14) 

where cr^ is the variance of the noise. Thus a more conventional measure of per- 
formance at learning a function is equal to the universal learning curve defined 
purely by information theoretic criteria. In other words, if a learning curve is 
measured in the right units, then its integral represents the amount of the useful 
information accumulated. Since one would expect any learning curve to decrease 
to zero eventually, we again obtain the 'law of diminishing returns'. 

Different quantities related to the subextensive entropy have been discussed 
in several contexts. For example, the code length i{N) has been defined as a 
learning curve in the specific case of neural networks (Opper and Haussler 1995) 
and has been termed the "thermodynamic dive" (Crutchfield and Shalizi 1998) 
and "A^*^ order block entropy" (Grassberger 1986). Mutual information between 
all of the past and all of the future (both semi-infinite) is known also as the "ex- 
cess entropy," "effective measure complexity," "stored information," and so on 
[see Shalizi and Crutchfield (1999) and references therein, as well as the discus- 
sion below]. If the data allow a description by a model with a finite number of 
parameters, then mutual information between the data and the parameters is of 
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interest, and this is also the predictive information about all of the future; some 
special cases of this problem have been discussed by Opper and Haussler (1995) 
and by Herschkowitz and Nadal (1999). What is important is that the predictive 
information or subextensive entropy is related to all these quantities, and that it can be 
defined for any process without a reference to a class of models. It is this universality 
that we find appealing, and this universality is strongest if we focus on the limit 
of long observation times. Qualitatively, in this regime (T -^ oo) we expect the 
predictive information to behave in one of three different ways: it may either stay 
finite, or grow to infinity together with T; in the latter case the rate of growth may 
be slow (logarithmic) or fast (sub linear power). 

The first possibility, Mirt^oo IpmdiT) = constant, means that no matter how 
long we observe we gain only a finite amount of information about the future. 
This situation prevails, for example, when the dynamics are too regular: for a 
purely periodic system, complete prediction is possible once we know the phase, 
and if we sample the data at discrete times this is a finite amount of information; 
longer period orbits intuitively are more complex and also have larger Ipred, but 
this doesn't change the limiting behavior liniT^^oo -^pred (T) = constant. 

Alternatively, the predictive information can be small when the dynamics 
are irregular but the best predictions are controlled only by the immediate past, 
so that the correlation times of the observable data are finite [see, for example, 
Crutchfield and Feldman (1997) and the fixed short-range interactions plot on 
Fig. (|Z3D]. Imagine, for example, that we observe x{t) at a series of discrete times 
{tn}, and that at each time point we find the value x^- Then we can always write 
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the joint distribution of the A^ data points as a product, 

P{xi,X2,---,xn) = P{xi)P{x2\xi)P{x3\x2,xi)---. (2.15) 

For Markov processes, what we observe at t^ depends only on events at the pre- 
vious time step in-i/ so that 

-P(a;n|{a;i<i<n-l}) = P{Xn\Xn-l), (2.16) 



and hence the predictive information reduces to 

-T I XjiXn— 1 , 



'pred 



log2 



(2.17) 



P(x„) 

The maximum possible predictive information in this case is the entropy of the 
distribution of states at one time step, which in turn is bounded by the logarithm 
of the number of accessible states. To approach this bound the system must main- 
tain memory for a long time, since the predictive information is reduced by the 
entropy of the transition probabilities. Thus systems with more states and longer 
memories have larger values of /pred- 

More interesting are those cases in which /pred (T) diverges at large T. In physi- 
cal systems we know that there are critical points where correlation times become 
infinite, so that optimal predictions will be influenced by events in the arbitrarily 
distant past. Under these conditions the predictive information can grow with- 
out bound as T becomes large; for many systems the divergence is logarithmic, 
/predl/" — ^ oo) oc In T, as for the variable Jy, short range Ising model of Figs. ( |Z2| , 



P^D . Long range correlation also are important in a time series where we can learn 
some underlying rules. It will turn out that when the set of possible rules can be 
described by a finite number of parameters, the predictive information again di- 
verges logarithmically, and the coefficient of this divergence counts the number 
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of parameters. Finally, a faster growth is also possible, so that Iprcd{T -^ oo) oc T", 
as for the variable Jy long range Istng model, and we shall see that this behavior 
emerges from, for example, nonparametric learning problems. 

2.4 Learning and predictability 

Learning is of interest precisely in those situations where correlations or associ- 
ations persist over long periods of time. In the usual theoretical models, there 
is some rule underlying the observable data, and this rule is valid forever; ex- 
amples seen at one time inform us about the rule, and this information can be 
used to make predictions or generalizations. The predictive information quanti- 
fies the average generalization power of examples, and we shall see that there is 
a direct connection between the predictive information and the complexity of the 
possible underlying rules. 

2.4.1 A test case 

Let us begin with a simple example already mentioned above. We observe two 
streams of data x and y, or equivalently a stream of pairs {xi.yi), {x2,y2), ■ ■ • , 
(xn, 1/n)- Assume that we know in advance that the x's are drawn independently 
and at random from some distribution P{x), while the y's are noisy versions of 
some function acting on x, 

Vn = fiXn, a) + Vn, (2.18) 
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where /(x; a) is a class of functions parameterized by ex, and r^n is some noise 
which for simplicity we will assume is Gaussian with some known standard de- 
viation a. We can even start with a very simple case, where the function class is 
just a linear combination of some basis functions, so that 

K 

f{x; ct) = Y^ a^(f)f,{x). (2.19) 

The usual problem is to estimate, from N pairs {xj, yi}, the values of the param- 
eters a; in favorable cases such as this we might even be able to find an effective 
regression formula. We are interested in evaluating the predictive information, 
which means that we need to know the entropy S{N). We go through the calcu- 
lation in some detail because it provides a model for the more general case. 

To evaluate the entropy S{N) we first construct the probability distribution 
P{xi, yi,X2, 1/2, ■ ■ ■ 5 2;n, Vn)- The same set of rules apply to the whole data stream, 
which here means that the same parameters ex apply for all pairs {xi,yi}, but 
these parameters are chosen at random from a distribution V{<x) at the start of 
the stream. Thus we write 

P{xi,yi,X2,y2,---,x^,yN) 
= I d^a P{xi,yi, X2, 1/2, ■ ■ ■ , xj<(, y^\a)V{a) , (2.20) 

and now we need to construct the conditional distributions for fixed ex. By hy- 
pothesis each X is chosen independently, and once we fix ex each yi is correlated 
only with the corresponding xi, so that we have 

N 

P{xi, yi,X2, 1/2, ■ ■ ■ , a;N, y^la) = H [^(^i) Piy^l^u «)] • (2-21) 

i=l 

Further, with the simple assumptions above about the class of functions and 
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Gaussian noise, the conditional distribution of yi has the form 



P{.yi\xucx) 



v^ 



exp 



TTO"^ 



2a 



K 

E 



2 1 1/i - E "M<^M(^i) 



Putting all these factors together. 



where 






X exp 



K 



N ^ 

fj,,u=l M=l 



(2.22) 



(2.23) 






1 



^0^(xi)0i.(a;i), and 



i=l 

N 



E^i<^M(^i)- 



(2.24) 
(2.25) 



Our placement of the factors of N means that both A^^, and B^ are of order unity 
as A^ ^ oo. These quantities are empirical averages over the samples {xi,yi}, 
and if the 0^ are well behaved we expect that these empirical means converge to 
expectation values for most realizations of the series {xi}: 



lim A^^{{xi}) = A'^^ = — dxP{x)(j)^{x)(j)^{x) , 

K 



lim B,{{x,,y,}) = B^ = J2A 



N^oo 



°°a„ 






(2.26) 
(2.27) 



where a. are the parameters that actually gave rise to the data stream {xi, yi}. In 
fact we can make the same argument about the terms in Y. vh 



lim Y.y^ = Na^ 



N^oo 



i=l 



K 



J2 "m^".".^ + 1 

li,u=l 



(2.28) 
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Conditions for this convergence of empirical means to expectation values are at 
the heart of learning theory. Our approach here is first to assume that this conver- 
gence works, then to examine the consequences for the predictive information, 
and finally to address the conditions for and implications of this convergence 
breaking down. 

Putting the different factors together, we obtain 






.i=l 



V27ra 



2 ^ j d^^aV{cx) exp [-NEn{cx- {x-,, y,})] , 

(2.29) 



where the effective "energy" per sample is given by 

1 1 ^ 
E^icx; {xi, yi}) = ^ + ^ E ("a« - "m)^".(«- - "-)• (2-30) 

Here we use the symbol ^ to indicate that we not only take the limit of large A^, 
but also neglect the fluctuations. Note that in this approximation the dependence 
on the sample points themselves is hidden in the definition of a as being the 
parameters that generated the samples. 

The integral that we need to do in Eq. ( [2.291 ) involves an exponential with a 
large factor A^ in the exponent; the free energy F^ is of order unity as A^ ^ oo. 
This suggests that we evaluate the integral by a saddle point or steepest descent 
approximation [similar analyses were performed by Clarke and Barron (1990), by 
MacKay (1992), and by Balasubramanian (1997)]: 

J d^'aVicx) exp [-ArEjv(a; {xu y,})] ^ V{oi,,) 

\ K N I 1 

xexp -A^EAr(Q:ci;{xi,|/i}) - — In- -IndetJFjyH , (2.31) 

l Ztx I 
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where olci is the "classical" value of a determined by the extremal conditions 

dEN{oL]{xuyi}) 



da. 



0, 



(2.32) 



the matrix Tn consists of the second derivatives of E^^, 

d^EN{a;{xi,yi}) 



T, 



N 



dandai, 



(2.33) 



Gt=CX.r 



and ■ ■ ■ denotes terms that vanish as A^ ^ oo. If we formulate the problem of 
estimating the parameters a. from the samples {xi, y\}, then as A^ ^ oo the matrix 
NJ^N is the Fisher information matrix (Cover and Thomas 1991); the eigenvectors 
of this matrix give the principal axes for the error ellipsoid in parameter space, 
and the (inverse) eigenvalues give the variances of parameter estimates along 
each of these directions. The classical cx.c\ differs from a. only in terms of order 
1/N; we neglect this difference and further simplify the calculation of leading 
terms as A^ becomes large. After a little more algebra, then, we find the probabil- 
ity distribution we have been looking for: 



P(xi,|/i,X2,i/2,---,a;N,l/N) 

N 



n^(^o 



.1=1 



-'P{ol) exp 



— In(27rea2) \nN + 



(2.34) 



where the normalization constant 



■27r)^ det A° 



(2.35) 



Again we note that the sample points {xi, y\} are hidden in the value of a that 
gave rise to these pointsj^ 



"^We emphasize again that there are two approximations leading to Eq. ( 2.34 ). First, we have 
replaced empirical means by expectation values, neglecting fluctuations associated with the par- 
ticular set of sample points {xi, j/i}. Second, we have evaluated the average over parameters in a 
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To evaluate the entropy S{N) we need to compute the expectation value of the 
(negative) logarithm of the probability distribution in Eq. ( [2 .341 ); there are three 
terms. One is constant, so averaging is trivial. The second term depends only 
on the Xi, and because these are chosen independently from the distribution P{x) 
the average again is easy to evaluate. The third term involves a., and we need to 
average this over the joint distribution P{xi,yi,X2,y2-,- ■ ■ i^^iV^)- As above, we 
can evaluate this average in steps: first we choose a value of the parameters a, 
then we average over the samples given these parameters, and finally we average 
over parameters. But because a is defined as the parameters that generate the 
samples, this stepwise procedure simplifies enormously. The end result is that 



S{N) = N 



5, + -Iog2(27reor2) 



-log^N + S^ + {log^ZA)c. + ---, (2.36) 



where (••■)« means averaging over parameters, Sx is the entropy of the distribu- 
tion of X, 

Sx = - [dxP{x)\og2P{x), (2.37) 

and similarly for the entropy of the distribution of parameters, 

Sc, = - Jd^aV{a)\og^V{a). (2.38) 

saddle point approximation. At least under some condition, both of these approximations would 
become increasingly accurate as iV ^ oo, so that this approach should yield the asymptotic be- 
havior of the distribution and hence the subextensive entropy at large N. Although we give a 
more detailed analysis below, it is worth noting here how things can go wrong. The two ap- 
proximations are independent, and we could imagine that fluctuations are important but saddle 
point integration still works, for example. Controlling the fluctuations turns out to be exactly the 
question of whether our finite parameterization captures the true dimensionality of the class of 
models, as discussed in the classic work of Vapnik, Chervonenkis, and others [see Vapnik (1998) 
for a review]. The saddle point approximation can break down because the saddle point becomes 
unstable or because multiple saddle points become important. It will turn out that instability is 
exponentially improbable as A^ ^ oo, while multiple saddle points are a real problem in certain 
classes of models, again when counting parameters doesn't really measure the complexity of the 
model class. 
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The different terms in the entropy Eq. (|2.36|) have a straightforward interpre 



tation. First we see that the extensive term in the entropy, 

<So = ^. + ilog2(27rea2), (2.39) 

reflects contributions from the random choice of x and from the Gaussian noise 
in y; these extensive terms are independent of the variations in parameters ex, 
and these would be the only terms if the parameters were not varying (that is, 
if there were nothing to learn). There also is a term which reflects the entropy 
of variations in the parameters themselves, 8^- This entropy is not invariant 
with respect to coordinate transformations in the parameter space, but the term 
(log2 ZA)a compensates for this noninvariance. Finally, and most interestingly for 
our purposes, the subextensive piece of the entropy is dominated by a logarith- 
mic divergence, 

5i(iV)-^^log2iV (bits). (2.40) 

The coefficient of this divergence counts the number of parameters independent 
of the coordinate system that we choose in the parameter space. Furthermore, 
this result does not depend on the set of basis functions {0^(x)}. This is a hint 
that the result in Eq. ( |2.40D is more universal than our simple example. 

2.4.2 Learning a parameterized distribution 

The problem discussed above is an example of supervised learning: we are given 
examples of how the points x^ map into y^, and from these examples we are to in- 
duce the association or functional relation between x and y. An alternative view is 
that pair of points (x, y) should be viewed as a vector x, and what we are learning 
is the distribution of this vector. The problem of learning a distribution usually 
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is called unsupervised learning, but in this case supervised learning formally is a 
special case of unsupervised learning; if we admit that all the functional relations 
or associations that we are trying to learn have an element of noise or stochastic- 
ity, then this connection between supervised and unsupervised problems is quite 
general. 

Suppose a series of random vector variables {x{\ are drawn independently 
from the same probability distribution Q{x\ol), and this distribution depends on 
a (potentially infinite dimensional) vector of parameters ex.. As above, the param- 
eters are unknown, and before the series starts they are chosen randomly from 
a distribution V{oi). With no constraints on the densities ^(q;) or Q{x\ci) it is 
impossible to derive any regression formulas for parameter estimation, but one 
can still calculate the leading terms in the entropy of the data series and thus the 
predictive information. 

We begin with the definition of entropy 

S{N) = S[{xi}] = - I dxi--- rf^N ^(^1, ^2, ■ ■ ■ , ^n) log2 P{xi, X2,---, ^n)- (2.41) 
By analogy with Eq. ( |2.2UD we then write 

N 

P(fi, f2, ■ ■ ■ , ^n) = / d^aVia) Y[ Q{xi\a). (2.42) 

Next, combining the last two equations and rearranging the order of integration, 
we can rewrite S{N) as 

S{N) = - I d^aVict)l f dxi---dx^f[Q{xj\ct) \og2P{{xi})\. (2.43) 

Eq. ( [2.43D allows an easy interpretation. There is the 'true' set of parameters 



ex that gave rise to the data sequence xi- ■ ■ xj^ with the probability Jljli Q{xj\a). 
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We need to average logg P{xi- ■ -x^) first over all possible realizations of the data 
keeping the true parameters fixed, and then over the parameters ex themselves. 
With this interpretation in mind, the joint probability density, the logarithm of 
which is being averaged, can be rewritten in the following useful way: 



N N 

P(xi,---,xn) = X{Q{x;\cx) d^aV{oi)X{ 
i=i •' i=i 



N 



Q{xi\cx) 



<5(5i|a) 
f\Q{x-^\a) j d^aV{ct)e^v[-N£N{oi-{x;})], (2.44) 



j=i 



£n{ol] {x\}) 



AT ^-^ 



N 



i=l 



Q{x;\cx) 



(2.45) 



_Q{x\\a) 

Since, by our interpretation, o: are the true parameters that gave rise to the partic- 
ular data {xi}, we may expect empirical means to converge to expectation values, 
so that 

Q{x\ci) 



SN{ct', {xi}) = — d^xQ{x\cx)\n 



ilj{a,a;{xi}), 



(2.46) 



_Q{x\a) 

where ^ ^ as A^ ^ oo; here we neglect tp, and return to this term below. 

The first term on the right hand side of Eq. ( [2.46D is the KuUback-Leibler di- 
vergence, Dkl(«| |q:)/ between the true distribution characterized by parameters 
ex and the possible distribution characterized by ex. Thus at large A^ we have 

N 

Pixi,X2,---,x^)=^l[Q{xi\ex) /rf^aP(a)exp[-iVDKL(«||«)], (2.47) 

j=i -^ 

where again the notation ^ reminds us that we are not only taking the limit of 
large N but also making another approximation in neglecting fluctuations. By 
the same arguments as above we can proceed (formally) to compute the entropy 
of this distribution, and we find 



S{N) 



So-N + s[''\n) 



(2.48) 
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Sq = d^aV{ct) — / d^xQ{x\cii.)\og2Q{x\cx 
S[''\n) = - f d^aV (a) \og2 / t/^^'aP(a)e-^^KL("ll") 



and (2.49) 
(2.50) 



Here Si is an approximation to 5*1 that neglects fluctuations tp. This is the same 
as the annealed approximation in the statistical mechanics of disordered systems, 
as has been used widely in the study of supervised learning problems (Seung et 
al. 1992). Thus we can identify the data sequence xi ■ ■ ■ xn with the disorder, 
£Nict;{xi}) with the energy of the quenched system, and DKhictWoi.) with its an- 
nealed analogue. 

The extensive term So, Eq. ( |2.49D , is the average entropy of a distribution in 



our family of possible distributions, generalizing the result of Eq. ( |2.39| ). The 
subextensive terms in the entropy are controlled by the N dependence of the 
partition function 

Z{a;N)= fd^^aV{a)exp[-NDKL{c^\\cy.)], (2.51) 

and 5'i(A^) = — {\og2 Z {a; N)) ^ is analogous to the free energy. Since what is 
important in this integral is the Kullback-Leibler (KL) divergence between dif- 
ferent distributions, it is natural to ask about the density of models that are KL 
divergence D away from the target ex, 

p{D;a)= f d^aV{a)6[D - DKL{ct\\cx)]; (2.52) 

note that this density could be very different for different targets. The density 
of divergences is normalized because the original distribution over parameter 
space, -P(q;), is normalized, 

/ dDp{D; a) = I d^aV{ct) = 1. (2.53) 
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Finally, the partition function takes the simple form 

Z{oi] N)= f dDp{D; a) exp[-ND]. (2.54) 

We recall that in statistical mechanics the partition function is given by 

Z{f3) = I dEp{E) exp[-(3E], (2.55) 

where p{E) is the density of states that have energy E, and /? is the inverse tem- 
perature. Thus the subextensive entropy in our learning problem is analogous to 
a system in which energy corresponds to the KuUback-Leibler divergence rela- 
tive to the target model, and temperature is inverse to the number of examples. 
As we increase the length A^ of the time series we have observed, we "cool" the 
system and hence probe models which approach the target; the dynamics of this 
approach is determined by the density of low energy states, that is the behavior 
oi p{D; a) as D ^0. 

The structure of the partition function is determined by a competition be- 
tween the (Boltzmann) exponential term, which favors models with small D, and 
the density term, which favors values of D that can be achieved by the largest 
possible number of models. Because there (typically) are many parameters, there 
are very few models with D ^ 0. This picture of competition between the Boltz- 
mann factor and a density of states has been emphasized in previous work on 
supervised learning (Haussler et al. 1996). 

The behavior of the density of states, p{D; a), at small D is related to the more 
intuitive notion of dimensionality. In a parameterized family of distributions, the 
KuUback-Leibler divergence between two distributions with nearby parameters 
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is approximately a quadratic form, 

Dkl(q:| la) ~ - X1("m ~ oi^)T^u{ay - a^) H , (2.56) 

where JF is the Fisher information matrix. Intuitively, if we have a reasonable 
parameterization of the distributions, then similar distributions will be nearby 
in parameter space, and more importantly points that are far apart in parameter 
space will never correspond to similar distributions; Clarke and Barron (1990) 
refer to this condition as the parameterization forming a "sound" family of dis- 
tributions. If this condition is obeyed, then we can approximate the low D limit 
of the density p{D;a): 

p{D-a) = f d^aP{ct)5[D-DKLi^\\ci)] 
d^aV{cx)6 

= j d^aV{ci+U-i)5 
where W is a matrix that diagonalizes JF, 

(W^-^-W)^, = A,,V. (2.58) 






(2.57) 



The delta function restricts the components of $, in Eq. ( |2.57| ) to be of order \/D or 
less, and so if P{oi.) is smooth we can make a perturbation expansion. After some 
algebra the leading term becomes 

p(D^O;a) ^P(a)^,^^^(det^)-^/^D(^-2)/2. (2.59) 

Here, as before, K is the dimensionality of the parameter vector. Computing the 
partition function from Eq. ( |2.54|) , we find 

Z(a;iV^oo)^/(a)-^]^, (2.60) 



2.4. Learning and predictability 34 

where /(a) is some function of the target parameter values. Finally, this allows 
us to evaluate the subextensive entropy, from Eqs. ( |2 .501 , 12311 ): 



5f)(Ar) = - j d^aV{a)\og^Z{oi-N) (2.61) 

^ |log2iV + ... (bits), (2.62) 

where ■ ■ • are finite as A^ -^ oo. Thus, general K-parameter model classes have 
the same subextensive entropy as for the simplest example considered in the pre- 
vious section. To the leading order, this result is independent even of the prior 
distribution V{ol) on the parameter space, so that the predictive information 
seems to count the number of parameters under some very general conditions 
[cf. Fig. ( [2.3D for a numerical example of the logarithmic behavior]. 

Although Eq. ( [2.62[ ) is true under a wide range of conditions, this cannot be 
the whole story. Much of modern learning theory is concerned with the fact 
that counting parameters is not quite enough to characterize the complexity of 
a model class; the naive dimension of the parameter space K should be viewed 
in conjunction with the Vapnik-Chervonenkis (VC) dimension dye (also known 
as the pseudodimension) and the phase space dimension d. The phase space di- 
mension is defined in the usual way through the scaling of volumes in the model 
space (see, for example, Opper 1994). On the other hand, rfyc measures not vol- 
umes, but capacity of the model class, and its definition is a bit trickier: for a set 
of binary (indicator) functions F{x, ol), VC dimension is defined as the maximal 
number of vectors xi ■ ■ ■ Xd^^ ^^^ ^^^ ^^ classified into two different classes in 
all 2^^^^ possible ways using this set of functions. Similarly, for real-valued func- 
tions F(x, a) one can first define a complete set of indicators using step functions, 
6 [F(x, ex) — P], and then the VC dimension of this set is the VC dimension of the 



2.4. Learning and predictability 35 

real-valued functions (Vapnik 1998). Separation of a vector in all possible ways is 
called shattering, and hence another name for the VC dimension — the shattering 
dimension. 

Both d and dye can differ from the number of parameters in several ways. One 
possibility is that d^c is infinite when the number of parameters is finite, a prob- 
lem discussed below. Another possibility is that the determinant of JF is zero, 
and hence dye arid d are both smaller than the number of parameters because we 
have adopted a redundant description. It is possible that this sort of degeneracy 
occurs over a finite fraction but not all of the parameter space, and this is one way 
to generate an effective fractional dimensionality. One can imagine multifractal 
models such that the effective dimensionality varies continuously over the pa- 
rameter space, but it is not obvious where this would be relevant. Finally, models 
with d < dye < oo are also possible [see, for example, Opper (1994)], and this list 
probably is not exhaustive. 

The calculation above, Eq. ( |2.59D , lets us actually define the phase space dimen- 
sion through the exponent in the small Dkl behavior of the model density, 

p(D^0;a)ocD(^-2)/2, (2.63) 

and then d appears in place of K as the coefficient of the log divergence in Si{N) 
(Clarke and Barron 1990, Opper 1994). However, this simple conclusion can fail 
in two ways. First, it can happen that a macroscopic weight gets accumulated 
at some nonzero value of -Dkl/ so that the small D^l behavior is irrelevant for 
the large A^ asymptotics. Second, the fluctuations neglected here may be uncon- 
trollably large, so that the asymptotics are never reached. Since controllability of 
fluctuations is a function of dye (see Vapnik 1998 and later in this paper), we may 
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summarize this in the following way. Provided that the small Dkl behavior of the 
density function is the relevant one, the coefficient of the logarithmic divergence 
of /prcd measures the phase space or the scaling dimension d and nothing else. 
This asymptote is valid, however, only for A^ 3> dye- It is still an open question 
whether the two pathologies that can violate this asymptotic behavior are related. 

2.4.3 Learning a parameterized process 

Consider a process where samples are not independent, and our task is to learn 
their j oint distribution Q{xi,- ■ ■ ,x^\cx). Again, ex. is an unknown parameter vector 
which is chosen randomly at the beginning of the series. If ck is a -ftT dimensional 
vector, then one still tries to learn just K numbers and there are still N examples, 
even if there are correlations. Therefore, although such problems are much more 
general than those considered above, it is reasonable to expect that the predictive 
information is still measured by {K/2) log2 N provided that some conditions are 
met. 

One might suppose that conditions for simple results on the predictive in- 
formation are very strong, for example that the distribution Q is a finite order 
Markov model. In fact all we really need are the following two conditions: 

S[{x,}\(x] = -Jd''xQ{{x,}\a)\og,Q{{x,}\<x) 

^ NSo + S;; S* = 0{1), (2.64) 

DKL[Qm}\a)\\Q{{x,}\a)] ^ NV^l ma) + o{N) . (2.65) 

Here the quantities Sq, Sq, and "Dkl (q:||q:) are defined by taking limits A^ ^ oo 
in both equations. The first of the constraints limits deviations from extensivity 
to be of order unity, so that if ex is known there are no long range correlations 
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in the data — all of the long range predictability is associated with learning the 
parameters.0 The second constraint, Eq. ( |2.65| ), is a less restrictive one, and it 
ensures that the "energy" of our statistical system is an extensive quantity. 

With these conditions it is straightforward to show that the results of the pre- 
vious subsection carry over virtually unchanged. With the same cautious state- 
ments about fluctuations and the distinction between K, d, and rfyc/ one arrives 
at the result: 

S{N) = So-N + st\N) , (2.66) 

S[^\n) = ^\og,N + --- (bits), (2.67) 

where ■ ■ • stands for terms of order one. Note again that for the results Eq. ( |2.67|) 
to be valid, the process considered is not required to be a finite order Markov pro- 
cess. Memory of all previous outcomes may be kept, provided that the accumu- 
lated memory does not contribute a divergent term to the subextensive entropy. 
It is interesting to ask what happens if the condition in Eq. ( |2.64| ) is vio- 
lated, so that there are long range correlations even in the conditional distribution 
(5(xi, • • • , xn|o;). Suppose, for example, that Sq = [K* /2) log2 A^. Then the subex- 
tensive entropy becomes 

St\N) = ^^^\og,N + --- (bits). (2.68) 

We see the that the subextensive entropy makes no distinction between predicta- 
bility that comes from unknown parameters and predictability that comes from 
intrinsic correlations in the data; in this sense, two models with the same K + K* 



^Suppose that we observe a Gaussian stochastic process and we try to learn the power spec- 
trum. If the class of possible spectra includes ratios of polynomials in the frequency (rational 
spectra) then this condition is met. On the other hand, if the class of possible spectra includes 1// 
noise, then the condition may not be met. For more on long range correlations, see below. 
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are equivalent. This, actually, must be so. As an example, consider a chain of 
Ising spins with long range interactions in one dimension. This system can order 
(magnetize) and exhibit long range correlations, and so the predictive informa- 
tion will diverge at the transition to ordering. In one view, there is no global pa- 
rameter analogous to ex., just the long range interactions. On the other hand, there 
are regimes in which we can approximate the effect of these interactions by say- 
ing that all the spins experience a mean field which is constant across the whole 
length of the system, and then formally we can think of the predictive informa- 
tion as being carried by the mean field itself. In fact there are situations in which 
this is not just an approximation, but an exact statement. Thus we can trade a de- 
scription in terms of long range interactions {K* ^ 0, but K = G) for one in which 
there are unknown parameters describing the system but given these parameters 
there are no long range correlations (i^ 7^ 0, K* = 0). The two descriptions are 
equivalent, and this is captured by the subextensive entropy^] 

2,4.4 Taming the fluctuations: finite c/yc case 

The preceding calculations of the subextensive entropy ^i are worthless unless 
we prove that the fluctuations ip are controllable. In this subsection we are going 
to discuss when and if this, indeed, happens. We limit the discussion to the anal- 
ysis of fluctuations in the case of finding a probability density (Section [2.4.2D ; the 
case of learning a process (Section [Z.4.3p is very similar. 



Clarke and Barron (1990) solved essentially the same problem. They did not 
make a separation into the annealed and the fluctuation term, and the quantity 

''There are a number of interesting questions about how the coefficients in the diverging pre- 
dictive information relate to the usual critical exponents, and we hope to return to this problem 
in a later paper. 
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they were interested in was a bit different from ours, but, interpreting loosely, 
they proved that, modulo some reasonable technical assumptions on differen- 
tiability of functions in question, the fluctuation term always approaches zero. 
However, they did not investigate the speed of this approach, and we believe 
that, by doing so, they missed some important qualitative distinctions between 
different problems that can arise due to a difference between d and d^c- In or- 
der to illuminate these distinctions, we here go through the trouble of analyzing 
fluctuations all over again. 

Returning to Eqs. ( |2.44| , [2.46D and the definition of entropy, we can write the 
entropy S{N) exactly as 



S{N) 



N 



d^aV{a) / W \d^Q{x^\a.)] 



xlog2 



■TV 

X{Q{x;\ci) d^aV{a)e 



-ND^I^{ol\\oc)+N'4,{ol,6c;{x;}) 



. (2.69) 



This expression can be decomposed into the terms identified above, plus a new 
contribution to the subextensive entropy that comes from the fluctuations alone. 



.(f) 



S'l^N): 



S{N) 



M 



SQ-N + Sr{N) + Sr{N) 



m^ 



N 



X lo: 



d^aV{a.) Yl [dxjQ{xj\a)] 

d aVjCx) NDKUcx\\oc)+Ni^(cx,oc-Ax-A) 

Z{ct;N) 



■g2 



(2.70) 



(2.71) 



where ^ is defined as in Eq. (|2.46D , and Z as in Eq. (|2.51|) . 

Some loose but useful bounds can be established. First, the predictive infor- 
mation is a positive (semidefinite) quantity, and so the fluctuation term may not 
be smaller than the value of —Si as calculated in Eqs. ( [2.62| , |2.67D . Second, since 
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fluctuations make it more difficult to generalize from samples, the predictive in- 
formation should always be reduced by fluctuations, so that S'^^'^ is negative. This 
last statement corresponds to the fact that for the statistical mechanics of disor- 
dered systems, the annealed free energy always is less than the average quenched 
free energy, and may be proven rigorously by applying Jensen's inequality to the 
(concave) logarithm function in Eq. (|2.71D ; essentially the same argument was 
given by Opper and Haussler (1995). A related Jensen's inequality argument al- 
lows us to show that the total Si{N) is bounded, 

Si{N) < N f d^a f d^aV{a)r{a.)DKL{ot\\a) 

= {DKLict\\a.))oc,a, (2.72) 

so that if we have a class of models (and a prior P(q;)) such that the average 
Kullback-Leibler divergence among pairs of models is finite, then the subexten- 
sive entropy is necessarily properly defined. Note that (-DklIckIIq:))^,^ includes 
So as one of its terms, so that usually Sq and ^i are well- or ill-defined together. 

Tighter bounds require nontrivial assumptions about the classes of distribu- 
tions considered. The fluctuation term would be zero if 4' were zero, and ^ is the 
difference between an expectation value (KL divergence) and the corresponding 
empirical mean. There is a broad literature that deals with this type of difference 
(see, for example, Vapnik 1998). 

We start with the case when the pseudo-dimension (rfvc) oi the set of proba- 
bility densities {Q{x\cx)} is finite. Then for any reasonable function F{x; (3), de- 
viations of the empirical mean from the expectation value can be bounded by 

probabilistic bounds of the form 

jjT.^F{x;-(3)-!dxQ{x\cx)F{x-(3) 



P < sup 

I /3 



L[F] 



> e 
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<M(e,iV,civc)e-^'^^', (2.73) 

where c and L[F] depend on the details of the particular bound used. Typically, 
c is a constant of order one, and L[F] is either some moment of F or the range 
of its variation. In our case, F is the log-ratio of two densities, so that L[F] 
may be assumed bounded for almost all (3 without loss of generality in view 
of Eq. ( [2.72D . In addition, M(e, A^, rfyc) is finite at zero, grows at most subexpo- 
nentially in its first two arguments, and depends exponentially on rfyc- Bounds 
of this form may have different names in different contexts: Glivenko-Cantelli, 
Vapnik-Chervonenkis, Hoeffding, Chernoff, ...; for review see Vapnik (1998) and 
the references therein. 

To start the proof of finiteness of Si in this case, we first show that only 
the region a ^ a is important when calculating the inner integral in Eq. ( |2.71| ). 
This statement is equivalent to saying that at large values of a — ck the KL di- 
vergence almost always dominates the fluctuation term, that is, the contribution 
of sequences of {xi} with atypically large fluctuations is negligible (atypicality 
is defined as ^ > 5, where 5 is some small constant independent of N). Since 
the fluctuations decrease as 1/Vn [see Eq. ( [2.73D ], and D^l is of order one, this 
is plausible. To show this, we bound the logarithm in Eq. ( [2.71[ ) by A^ times the 
supremum value of ip. Then we realize that the averaging over a. and {x^} is 
equivalent to integration over all possible values of the fluctuations. The worst 
case density of the fluctuations may be estimated by differentiating Eq. ( |2.73| ) 
with respect to e (this brings down an extra factor of A^e). Thus the worst case 
contribution of these atypical sequences is 

^(f),atypicai _ f^ ^^ N^^ M{e)e-'''''' ~ e-'^^^' < 1 for large N. (2.74) 

Js 
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This bound lets us focus our attention on the region a ^ a.. We expand the 
exponent of the integrand of Eq. ( [2.71D around this point and perform a simple 
Gaussian integration. In principle, large fluctuations might lead to an instability 
(positive or zero curvature) at the saddle point, but this is atypical and therefore 
is accounted for already Curvatures at the saddle points of both numerator and 
denominator are of the same order, and throwing away unimportant additive 
and multiplicative constants of order unity, we obtain the following result for the 
contribution of typical sequences: 

^f)'*^P""' ~ j d''aV{cx)d''x\{Q{^\cx)N{BA-^B); (2.75) 

j 
1 ^ d\ogQ{x;\6L) 

^- = ivV — dK-, — ' ^ ^'^ ' 

Here (■ ■ ■)^ means an averaging with respect to all x-^'s keeping a. constant. One 
immediately recognizes that B and A are, respectively, first and second deriva- 
tives of the empirical KL divergence that was in the exponent of the inner integral 
in Eq. ( ^TlD . 

We are dealing now with typical cases. Therefore, large deviations of A from 
JF are not allowed, and we may bound Eq. ( [2.75D by replacing A^^ with !F^^{l+5), 
where 5 again is independent of N. Now we have to average a bunch of products 
like 

d\ogQ{x;\oc) _^ aiogQ(fj|«) 
da^ ^'^ day 

over all x{s. Only the terms with i = j survive the averaging. There are K'^N such 
terms, each contributing of order A^~^ This means that the total contribution of 
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the typical fluctuations is bounded by a number of order one and does not grow 
with A^. This concludes the proof of controllability of fluctuations for dye < oo. 

2.4.5 Taming the fluctuations: the role of the prior 

One may notice that we never used the specific form of M(e, A^, rfyc)/ which is 
the only thing dependent on the precise value of the dimension. Actually, a more 
thorough look at the proof shows that we do not even need the strict uniform 
convergence enforced by the Glivenko-Cantelli bound. With some modifications 
the proof should still hold if there exist some a priori improbable values of ex and 
CK that lead to violation of the bound. That is, if the prior V{cil) has sufficiently 
narrow support, then we may still expect fluctuations to be unimportant even for 
VC-infinite problems. 

To see this, consider two examples. A variable x is distributed according to 
the following probability density functions: 



Q(x\a) = , — exp 



— [x — a) 
2 ^ . 



X e (-CX); +CX)) ; (2.77) 



Qix\a) = 72;r7^^^^— ^' ^e[0;27r). (2.78) 

Jq ax exp (— sm ax) 

Learning the parameter in the first case is a rfyc = 1 problem, while in the sec- 
ond case dye = oo. In the first example, as we have shown above, one may 
construct a uniform bound on fluctuations irrespective of the prior V{ct). The 
second one does not allow this. Indeed, suppose that the prior is uniform in a 
box < a < ftmax/ and zero elsewhere, with amax rather large. Then for not too 
many sample points A^, the data would be better fitted not by some value in the 
vicinity of the actual parameter, but by some much larger value, for which almost 
all data points are at the crests of — sin ax. Adding a new data point would not 
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help, until that best, but wrong, parameter estimate is less than amax-Q So the 
fluctuations are large, and the predictive information is small in this case. Even- 
tually, however, data points would overwhelm the box size, and the best esti- 
mate of a would swiftly approach the actual value. At this point the argument of 
Clarke and Barron (1990) would become applicable, and the leading behavior of 
the subextensive entropy would converge to its asymptotic value of (1/2) log A^. 
On the other hand, there is no uniform bound on the value of N for which this 
convergence will occur — it is guaranteed only for A^ ^ rfyc/ which is never true 
if dye = oo. For some sufficiently wide priors this asymptotically correct behav- 
ior would be never reached in practice. Further, if we imagine a thermodynamic 
limit where the box size and the number of samples both become large, then by 
analogy with problems in supervised learning (Seung et al. 1992, Haussler et al. 
1996) we expect that there can be sudden changes in performance as a function of 
the number of examples. The arguments of Clarke and Barron cannot encompass 
these phase transitions or "aha!" phenomena. 

Following the intuition inferred from this example, we can now proceed with 
a more formal analysis. As the above argument about the smalkiess of fluctua- 
tions in the finite dye case paralleled the discussion of the Empirical Risk Mini- 
mization (ERM) approach (Vapnik 1998), this present argument closely resembles 
some statements of the Structural Risk Minimization (SRM) theory (Vapnik 1998), 
which deals with the case of dye = oo or, equivalently, N/dyc < 1- While ERM 

solves the problem of uniform non-Bayesian learning, there seems to be a general 

^Interestingly, since for the model Eq. ( ^.781 ) KL divergence is bounded from below and above, 
for Qfniax ^ CO the weight in p{D; a) at small Z3kl vanishes, and a finite weight accumulates at 
some nonzero value of D. Thus, even putting the fluctuations aside, the asymptotic behavior 
based on the phase space dimension is invalidated, as mentioned above. 
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agreement that SRM theory is a solution to the problem of learning with a prior. 
However, to our knowledge, no explicit identification of why this is so has been 
done, so we try to do it here. 

Suppose that, as in the above example, admissible solutions of a learning 
problem belong to some subset Ci of the whole i^-dimensional parameter space 
C. Suppose also that for any finite Ci the VC dimension of the correspond- 
ing learning problem, rfvclC*!)/ is finite, but dyciC) = cxd. In SRM theory a 
nested set of such subspaces Ci C C2 C C3 C ■ ■ • is called a structure C if 
C = U Cn- Each Cn is known as a structure element. Since the subsets are nested, 
dYciCi) < rfvc(<^2) < dyciCs) < ■■: We know that these are the large VC di- 
mensions and, therefore, parameters that belong to the large structure elements 
C„, n ^ 00, that are responsible for large fluctuations. But in view of Eq. ( |2.53| ), 
for any properly defined prior V{a.), very large values of a. are a priori improb- 
able. Thus the fight between the prior and the data may result in an effective 
cutoff n*, so that all Cn,n > n*, contribute little to S[', and the fluctuations are 
controlled. 

Indeed, let's form a structure by assigning all a's for which — logP(a) -|- 
maxlogP < n to the element C„ {n is not necessarily integer). This imposes 
an a priori probability z/(n) on the elements themselves. Now we can bound 
the internal integral in Eq. ( |2.71D by replacing ^(a, a, {xi}) with ^pnict, {^i}) — its 
maximal value on the smallest element C„ that includes a. If the logarithm of the 
a priori probability z/(n) falls off faster than Nipni^t, {^i}) increases as n grows, 
then one can select a particular n*, for which the integral over all C„, n > n*, is 
smaller than any predefined 5. Effectively n* then serves as a cutoff. Note that, 
since fluctuations enter multiplied by A^, n* (N) is a nondecreasing function. If it 
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grows in a way such that (ivc(C'n*) is sublinear in A^ (~ A^/ log A^ suffices), then 
M(e, A^, five) is still subexponential, and we can use the proofs of the preceding 
section to show that the fluctuations are controllable. The only difference that 
occurs is that the contribution of typical fluctuations is dominated by a saddle 
point near ckci, which solves the equation 

d 
If OL is only in very large structure elements that contribute little to the internal 



[-\ogV{cx)+ND{a.\\a)] = Q. (2.79) 



integral of Eq. ( [2.71D , then olc\ may be quite far from ex.. That is, the best estimate 



of OL may be imprecise at any finite A^. This is particularly important in the case 
of nonparametric learning (see Sections [2.4.7| , p75|) . 



In finite dimensional cases similar to the above example, every C„, n < oo, has 
finite VC dimension d^c, and this dimension is bounded from above by the phase 
space dimension d. The magnitude of fluctuations depends mostly on rfyc- There- 
fore, beyond some n*{N) for which (ivc(C'n*) = d, the fluctuations will practically 
stop growing. This means that any proper prior V, how^ever slowly decreasing at 
infinities, is enough to impose a finite cutoff and render fluctuations finite. This 
is in complete agreement with Clark and Barron — ^but prior-dependent. 

We want to emphasize again that, in general, fluctuations are controlled only 
if two related, but not equivalent, assumptions are true. First, for any finite A^ 
there has to be a finite cutoff n*(N). This means that V{a) is narrow enough 
to define a valid structure. Second, for the fluctuations within C„. to be small, 

dYciCn*{N)) rnust grow sub linearly in A^. f\ In this case the number of samples 

* Actually, the n*-dependence of the factors similar to L[F], defined above, may require a dif- 
ferent, yet slower, growth [see Vapnik (1998) for details]. But this is outside the scope of this 
discussion. 
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eventually outgrows the current VC dimension by an arbitrarily large factor, and 
determination of parameters is possible to any precision. Both of these conditions 
are well known in SRM theory (Vapnik 1998). 

In the classical SRM theory, only selection of the law n* = n* (N) is a part of 
the problem, and the structure is usually assumed to be given. Ideally, this law 
is selected by minimizing the expected error of learning, which consists of uncer- 
tainties due to the limited set of allowed solutions {n* < oo) and due to the fluc- 
tuations within this set. These uncertainties behave oppositely as n* increases. If 
calculating the expected error is difficult, people may be content with even pre- 
selecting the law n* = n*{N), and then every law for which the VC dimension 
grows sublinearly does the job — ^better or worse — ^just as we have shown above. 
In our current treatment the structure and the law of the VC dimension growth are 
both a result of the prior. If the prior is appropriate, then so are the structure and 
the law. If not, then learning with this prior is impossible. On general grounds, 
we know that when the prior correctly embodies the a priori knowledge, it re- 
sults in the fastest average learning possible. Therefore we are guaranteed that, on 
average, the law n* = n*{N) is optimal if this law is imposed by the prior (see 



Sections p7^ p3| for more on this). 

Summarizing, we note that while much of learning theory has properly fo- 
cused on problems with finite VC dimension, it might be that the conventional 
scenario in which the number of examples eventually overwhelms the number 
of parameters or dimensions is too weak to deal with many real world prob- 
lems. Certainly in the present context there is not only a quantitative, but also a 
qualitative difference between reaching the asymptotic regime in just a few mea- 
surements, or in many millions of them. Finitely parameterizable models with 
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finite or infinite rfyc fall in essentially different universality classes with respect to 
the predictive information. 

2.4.6 Beyond finite parameterization: general considerations 

The previous sections have considered learning from time series where the un- 
derlying class of possible models is described with a finite number of parameters. 
If the number of parameters is not finite then in principle it is impossible to learn 
anything unless there is some appropriate regularization of the problem. If we 
let the number of parameters stay finite but become large, then there is more to 
be learned and correspondingly the predictive information grows in proportion 
to this number, as in Eq. ( |2.62D . On the other hand, if the number of parameters 
becomes infinite without regularization, then the predictive information should 
go to zero since nothing can be learned. We should be able to see this happen in 
a regularized problem as the regularization weakens: eventually the regulariza- 
tion would be insufficient and the predictive information would vanish. The only 
way this can happen is if the subextensive term in the entropy grows more and 
more rapidly with A^ as we weaken the regularization, until finally it becomes 
extensive at the point where learning becomes impossible. More precisely, if this 
scenario for the breakdown of learning is to work, there must be situations in 
which the predictive information grows with A^ more rapidly than the logarith- 
mic behavior found in the case of finite parameterization. 

Subextensive terms in the entropy are controlled by the density of models as 
function of their Kullback-Leibler divergence to the target model. If the models 
have finite VC and phase space dimensions then this density vanishes for small 
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divergences as p ~ /}(a'-2)/2 phenomenologically, if we let the number of param- 
eters increase, the density vanishes more and more rapidly. We can imagine that 
beyond the class of finitely parameterizable problems there is a class of regular- 
ized infinite dimensional problems in which the density p{D — > 0) vanishes more 
rapidly than any power oi D. As an example, we could have 



p{D ^0) ^ Aexp 



B 



p>0; 



(2.80) 



that is, an essential singularity at D = 0. For simplicity we assume that the con- 
stants A and B can depend on the target model, but that the nature of the essential 
singularity (/i) is the same everywhere. Before providing an explicit example, let 
us explore the consequences of this behavior. 

From Eq. ( |2.54| ) above, we can write the partition function as 
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(2.81) 



where in the last step we use a saddle point or steepest descent approximation 
which is accurate at large N , and the coefficients are 



i(a) = A{cc){'^^^^^—^\ ■ [i?(a)]i/(2^+2) 
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Finally we can use Eqs. ( |2.61[ , |2.81D to compute the subextensive term in the en- 
tropy, keeping only the dominant term at large A^, 



S't\N)^-^{C{cc))^m/^^+'^ (bits), 



(2.84) 
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where {■ ■ ■)a. denotes an average over all the target models. 

This behavior of the first subextensive term is qualitatively different from ev- 
erything we have observed so far. A power law divergence is much stronger than 
a logarithmic one. Therefore, a lot more predictive information is accumulated 
in an "infinite parameter" (or nonparametric) system; the system is much richer 
and more complex, both intuitively and quantitatively. 

Subextensive entropy also grows as a power law in a finitely parameterizable 
system with a growing number of parameters [compare to the spin chain with 
decaying interactions on Fig. (|Z3|)]. For example, suppose that we approximate 
the distribution of a random variable by a histogram with K bins, and we let K 
grow with the quantity of available samples as -ft' ~ A^'". Equation ( |2.62| ) sug- 



gests that in a /^-parameter system, the A^*'^ sample point contributes ~ K/2N 
bits to the subextensive entropy If K changes as mentioned, the A^**^ example 
then carries ~ N"^-^ bits. Summing this up over all samples, we find Si ~ N", 
and if we let u = /x/(/i + 1) we obtain Eq. ( |2.84D . Note that the growth of the 
number of parameters is slower than N {u = fi/{n + 1) < 1), which makes sense 
both intuitively and within the framework of the above SRM fluctuation analysis. 
Indeed, this growing number of parameters is nothing but expanding structure 
elements, and dye increasing with them, (ivc(C'n*(Af)) = dyciN). Therefore, sub- 
linear growth is needed for the fluctuation control. 

Power law growth of the predictive information illustrates the point made 
earlier about the transition from learning more to finally learning nothing as the 
class of investigated models becomes more complex. As /i increases, the problem 
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becomes richer and more complex, and this is expressed in the stronger diver- 
gence of the first subextensive term of the entropy; for fixed large N , the predic- 
tive information increases with /x. However, if /i ^ oo the problem is too complex 
for learning — in our model example the number of bins grows in proportion to 
the number of samples, which means that we are trying to find too much detail 
in the underlying distribution. As a result, the subextensive term becomes exten- 
sive and stops contributing to predictive information. Thus, at least to the leading 
order, predictability is lost, as promised. 

2.4.7 Beyond finite parameterization: example 

The discussion in the previous section suggests that we should look for power- 
law behavior of the predictive information in learning problems where rather 
than learning ever more precise values for a fixed set of parameters, we learn 
a progressively more detailed description — effectively increasing the number of 
parameters — as we collect more data. One example of such a problem is learn- 
ing the distribution Q{x) for a continuous variable x, but rather than writing a 
parametric form of Q{x) we assume only that this function itself is chosen from 
some distribution that enforces a degree of smoothness. There are some natu- 
ral connections of this problem to the methods of quantum field theory (Bialek, 
Callan, and Strong 1996) which we can exploit to give a complete calculation of 
the predictive information, at least for a class of smoothness constraints. 

We write Q{x) = (l//o) exp[— 0(a;)] so that positivity of the distribution is au- 
tomatic, and then smoothness may be expressed by saying that the 'energy' (or 
action) associated with a function (j){x) is related to an integral over its derivatives. 
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like the strain energy in a stretched string. The simplest possibility following this 
line of ideas is that the distribution of functions is given by 



'T^[<P{^)\ = ^exp 



;^/- 


[dxj _ 
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■1 
To 



I dxe-'^^''^ -I 



(2.85) 



where Z is the normalization constant for V\4>], the delta function insures that 
each distribution Q{x) is normalized, and / sets a scale for smoothness. If dis- 
tributions are chosen from this distribution, then the optimal Bayesian estimate 
of Q{x) from a set of samples xi,X2,- ■ ■ ,xn converges to the correct answer, and 
the distribution at finite A^ is nonsingular, so that the regularization provided 
by this prior is strong enough to prevent the development of singular peaks at 
the location of observed data points (Bialek, Callan, and Strong 1996) |. Further 
developments of the theory, including alternative choices of V[4>{x)\, have been 
given by Periwal (1997, 1998), Holy (1997), and Aida (1998). We chose the origi- 
nal formulation for our analysis because our goal here is to be illustrative rather 
than exhaustive. 

From the discussion above we know that the predictive information is related 
to the density of Kullback-Leibler divergences, and that the power-law behavior 
we are looking for comes from an essential singularity in this density function. 
To illustrate this point, we calculate the predictive information using the density, 
even though an easier direct way exists. 

With Q{x) = (l//o) exp[— 0(a;)], we can write the KL divergence as 

DKL[0(x)||0(x)] = ^y'dxexp[-0(x)][0(x)-0(a;)]. (2.86) 



^We caution the reader that our discussion in this section is less self-contained than in other 
sections. Since the crucial steps exactly parallel those in the earlier work, here we just give refer- 
ences. To com pen sate for this, we compiled a summary of the original results by Bialek et al. in 
the Appendix |A.l| . 
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We want to compute the density, 

p(D;0) = |[t/0(x)]P[0(x)]5(D-DKL[0(x)||0(x)]) (2.87) 

= M j[d<P{x)]V[cP{x)]6[MD-MDKi.[4>{x)\\<P{x)]), (2.88) 

where we introduce a factor M which we will allow to become large so that we 
can focus our attention on the interesting limit D ^ Q. To compute this inte- 
gral over all functions 0(a;), we introduce a Fourier representation for the delta 
function, and then rearrange the terms: 

p(D;0) = M f ^exp{izMD) f[d(P{x)]V[<l){x)]exp{-izMDKL) (2.89) 

z]V. 
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[d(t){x)]V[(f){x)] exp ( — / dx(f){x) exp[-0(x)] J . (2.90) 

The inner integral over the functions (f){x) is exactly the integral which was evalu- 
ated in the original discussion of this problem (Bialek, Callan and Strong 1996); in 
the limit that zM is large we can use a saddle point approximation, and standard 
field theoretic methods allow us to compute the fluctuations around the saddle 
point. The result is that [cf. Eqs. (^-(^)] 

/ izM f 
[d(f){x)]V[4>{x)] exp ( — / dx(j){x) exp[— 0(x)] 



( %ZM f - \ 

= exp f — / dx4>ci{x) exp[-0(x)] - S'cff[0ci(a;); zM]j ,(2.91) 

S,4<P,V,zM] = ljdx(^\ +l(^!^y'|rfa;exp[-0d(a:)/2], (2.92) 

?>A/f 7>A/f 

WlM^) + —- exp[-0,i(x)] = -- exp[-0(x)] . (2.93) 

Now we can do the integral over z, again by a saddle point method. The two 
saddle point approximations are both valid in the limit that Z^ ^ and MD^/"^ -^ 
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oo; we are interested precisely in the first limit, and we are free to set M as we 
wish, so this gives us a good approximation for p{D -^ 0; 0). Also, since M is 
arbitrarily large, 0ci(a;) = (j){x). This results in 
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(2.94) 

rf2;exp[-(^(x)/2] (2.95) 
(2.96) 



Except for the factor of D^^l'^ , this is exactly the sort of essential singularity that 
we considered in the previous section, with n = 1. The D •^/^ prefactor does not 
change the leading large A^ behavior of the predictive information, and we find 
that 

S[^\n) ~ --^— i^=( / dxexp[-^{x)/2] ) N'/\ (2.97) 



21n2v/II^\. /^ 

where (■ ■ ■)^ denotes an average over the target distributions (j){x) weighted once 
again by V[(j){x)]. Notice that if x is unbounded then the average in Eq. ( |2.97|) 
is infrared divergent; if instead we let the variable x range from to L then this 
average should be dominated by the uniform distribution. Replacing the average 
by its value at this point, we obtain the approximate result 



S['\N) ^ ^^^ (^X'%M. 



(2.98) 



To understand the result in Eq. ( [2.98D , we recall that this field theoretic ap- 
proach is more or less equivalent to an adaptive binning procedure in which we 



divide the range of x into bins of local size Jl/NQ{x) (Bialek, Callan, and Strong 
1996, see also Appendix |A.1D . From this point of view, Eq. ( |2.98|) makes perfect 
sense: the predictive information is directly proportional to the number of bins 
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that can be put in the range of x. This also is in direct accord with a comment from 
the previous subsection that power law behavior of predictive information arises 
from the number of parameters in the problem depending on the number of sam- 



ples. More importantly, since learning a distribution consisting of ~ J NL/l bins 



is, certainly, a rfyc ~ y NL/l problem, we can refer back to our discussion of fluc- 
tuations in prior controlled learning scenarios (Section |2.4.5|) to infer that fluctua- 



tions pose no threat to this nonparametric learning setup. 

One thing which remains troubling is that the predictive information depends 
on the arbitrary parameter / describing the scale of smoothness in the distribu- 
tion. In the original work it was proposed that one should integrate over possible 
values of / (Bialek, Callan and Strong 1996). Numerical simulations demonstrate 
that this parameter can be learned from the data itself (see Chapter ^, but per- 
haps even more interesting is a formulation of the problem by Periwal (1997, 
1998) which recovers complete coordinate invariance by effectively allowing / to 
vary with x. In this case the whole theory has no length scale, and there also is 
no need to confine the variable x to a box (here of size L). We expect that this co- 
ordinate invariant approach will lead to a universal coefficient multiplying \/N 
in the analog of Eq. ( [2.98D , but this remains to be shown. 

In summary, the field theoretic approach to learning a smooth distribution 
in one dimension provides us with a concrete, calculable example of a learning 
problem with power-law growth of the predictive information. The scenario is 
exactly as suggested in the previous section, where the density of KL divergences 
develops an essential singularity. Heuristic considerations (Bialek, Callan, and 
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Strong 1996; Aida 1999) suggest that different smoothness penalties [for exam- 
ple, replacing the kinetic term in the prior, Eq. ( [2 .851 ), by (9^0)^] and generaliza- 
tions to higher dimensional problems (dim x = Q will have sensible effects on the 
predictive information 

Si{N) ~ M/2^. (2.99) 

This shows a power-law growth. Smoother functions have smaller powers (less 
to learn) and higher dimensional problems have larger powers (more to learn) — 
but real calculations for these cases remain challenging. 

2.5 /pred as a measure of complexity 

The problem of quantifying complexity is very old. Solomonoff (1964), Kol- 
mogorov (1965), and Chaitin (1975) investigated a mathematically rigorous no- 
tion of complexity that measures (roughly) the minimum length of a computer 
program that simulates the observed time series [see also Li and Vitanyi (1993)]. 
Unfortunately there is no algorithm that can calculate the Kolmogorov complex- 
ity of any data set. In addition, algorithmic or Kolmogorov complexity is closely 
related to the Shannon entropy, which means that it measures something closer to 
our intuitive concept of randomness than to the intuitive concept of complexity 
[as discussed, for example, by Bennett (1990)]. These problems have fueled con- 
tinued research along two different paths, representing two major motivations 
for defining complexity. First, one would like to make precise an impression that 
some systems — such as life on earth or a turbulent fluid flow — evolve toward a 
state of higher complexity, and one would like to be able to classify those states. 
Second, in choosing among different models that describe an experiment, one 
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wants to quantify a preference for simpler explanations or, equivalently, provide 
a penalty for complex models that can be weighed against the more conventional 
goodness of fit criteria. We bring our readers up to date with some developments 
in both of these directions, and then discuss the role of predictive information 
as a measure of complexity This also gives us an opportunity to discuss more 
carefully the relation of our results to previous work. 

2.5.1 Complexity of statistical models 

The construction of complexity penalties for model selection is a statistics prob- 
lem. As far as we know, however, the first discussions of complexity in this con- 
text belong to philosophical literature. Even leaving aside the early contributions 
of William of Occam on the need for simplicity, Hume on the problem of in- 
duction, and Popper on falsifiability, Kemeney (1953) suggested explicitly that it 
would be possible to create a model selection criterion that balances goodness of 
fit versus complexity. Wallace and Burton (1968) hinted that this balance may re- 
sult in the model with "the briefest recording of all attribute information." Even 
though he probably had a somewhat different motivation, Akaike (1974a, 1974b) 
made the first quantitative step along these lines. His ad hoc complexity term was 
independent of the number of data points and was proportional to the number 
of free independent parameters in the model. 

These ideas were rediscovered and developed systematically by Rissanen in a 
series of papers starting from 1978. He has emphasized strongly (Rissanen 1984, 
1986, 1987) that fitting a model to data represents an encoding of those data, or 
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predicting future data, and that in searching for an efficient code we need to mea- 
sure not only the number of bits required to describe the deviations of the data 
from the model's predictions (goodness of fit), but also the number of bits re- 
quired to specify the parameters of the model (complexity). This specification 
has to be done to a precision supported by the dataj^ Rissanen (1984) and Clarke 
and Barron (1990) in full generality were able to prove that the optimal encod- 
ing of a model requires a code with length asymptotically proportional to the 
number of independent parameters and logarithmically dependent on the num- 
ber of data points we have observed. The minimal amount of space one needs 
to encode a data string (minimum description length or MDL) within a certain 
assumed model class was termed by Rissanen stochastic complexity, and in recent 
work he refers to the piece of the stochastic complexity required for coding the 
parameters as the model complexity (Rissanen 1996). This approach was further 
strengthened by a recent result (Vitanyi and Li 2000) that an estimation of param- 
eters using the MDL principle is equivalent to Bayesian parameter estimations 
with a "universal" prior (Li and Vitanyi 1993). 

There should be a close connection between Rissanen's ideas of encoding the 
data stream and the subextensive entropy. We are accustomed to the idea that the 
average length of a code word for symbols drawn from a distribution P is given 
by the entropy of that distribution; thus it is tempting to say that an encoding 
of a stream xi,X2,- ■ ■ ,xn will require an amount of space equal to the entropy 
of the joint distribution P(xi, X2, ■ ■ ■ , xn)- The situation here is a bit more subtle, 

because the usual proofs of equivalence between code length and entropy rely 

^°Within this framework Akaike's suggestion can be seen as coding the model to (suboptimal) 
fixed precision. 
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on notions of typicality and asymptotics as we try to encode sequences of many 
symbols; here we already have N symbols and so it doesn't really make sense to 
talk about a stream of streams. One can argue, however, that atypical sequences 
are not truly random within a considered distribution since their coding by the 
methods optimized for the distribution is not optimal. So atypical sequences are 
better considered as typical ones coming from a different distribution [a point 
also made by Grassberger (1986)]. This allows us to identify properties of an 
observed (long) string with the properties of the distribution it comes from, as 
was done by Vitanyi and Li (2000). If we accept this identification of entropy 
with code length, then Rissanen's stochastic complexity should be the entropy of 
the distribution P{xi, X2, ■ ■ ■ , xn). 

As emphasized by Balasubramanian (1996), the entropy of the joint distribu- 
tion of A^ points can be decomposed into pieces that represent noise or errors in 
the model's local predictions — an extensive entropy — and the space required to 
encode the model itself, which must be the subextensive entropy. Since in the 
usual formulation all long-term predictions are associated with the continued 
validity of the model parameters, the dominant component of the subextensive 
entropy must be this parameter coding, or model complexity in Rissanen's ter- 
minology. Thus the subextensive entropy should be the model complexity, and 
in simple cases where we can describe the data by a _ftr-parameter model both 
quantities are equal to {K/2) log2 A^ bits to the leading order. 

The fact that the subextensive entropy or predictive information agrees with 
Rissanen's model complexity suggests that Jpred provides a reasonable measure 
of complexity in learning problems. On the other hand, this agreement might 
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lead the reader to wonder if all we have done is to rewrite the results of Ris- 
sanen et al. in a different notation. To calm these fears we recall again that our 
approach distinguishes infinite VC problems from finite ones and treats nonpara- 
metric cases as well. Indeed, the predictive information is defined without ref- 
erence to the idea that we are learning a model, and thus we can make a link to 
physical aspects of the problem. 

2.5.2 Complexity of dynamical systems 

There is a strong prejudice that the complexity of physical systems should be 
measured by quantities that are at least related to more conventional thermody- 
namic quantities (temperature, entropy, . . .), since this is the only way one will be 
able to calculate complexity within the framework of statistical mechanics. Most 
proposals define complexity as an entropy-like quantity, but an entropy of some 
unusual ensemble. For example, Lloyd and Pagels (1988) identified complexity 
as thermodynamic depth, the entropy of the state sequences that lead to the cur- 
rent state. The idea is clearly in the same spirit as the measurement of predictive 
information, but this depth measure does not completely discard the extensive 
component of the entropy (Crutchfield and Shalizi 1999) and thus fails to resolve 
the essential difficulty in constructing complexity measures for physical systems: 
distinguishing genuine complexity from randomness (entropy), the complexity 
should be zero both for purely regular and for purely random systems. 

New definitions of complexity that try to satisfy these criteria (Lopez-Ruiz 
et al. 1995, Gell-Mann and Lloyd 1996, Shiner et al. 1999, Sole and Luque 1999, 
Adami and Cerf 2000) and criticisms of these proposals (Crutchfield et al. 1999, 
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Feldman and Crutchfield 1998, Sole and Luque 1999) continue to emerge even 
now. Aside from the obvious problems of not actually eliminating the extensive 
component for all or a part of the parameter space or not expressing complexity 
as an average over a physical ensemble, the critiques often are based on a clever 
argument first mentioned explicitly by Feldman and Crutchfield (1998). In an at- 
tempt to create a universal measure, the constructions can be made over-universal: 
many proposed complexity measures depend only on the entropy density Sq and 
thus are functions only of disorder — not a desired feature. In addition, many of 
these and other definitions are flawed because they fail to distinguish among the 
richness of classes beyond some very simple ones. 

In a series of papers, Crutchfield and coworkers identified statistical complexity 
with the entropy of causal states, which in turn are defined as all those microstates 
(or histories) that have the same conditional distribution of futures (Crutchfield 
and Young 1989, Shalizi and Crutchfield 1999). The causal states provide an op- 
timal description of a system's dynamics in the sense that these states make as 
good a prediction as the histories themselves. Statistical complexity is very simi- 
lar to predictive information, but Shalizi and Crutchfield (1999) define a quantity 
which is even closer to the spirit of our discussion: their excess entropy is exactly 
the mutual information between the semi-infinite past and future. Unfortunately, 
by focusing on cases in which the past and future are infinite but the excess en- 
tropy is finite, their discussion is limited to systems for which (in our language) 
IprediT -^ oo) = coHstant. 

In our view, Grassberger (1986) has made the clearest and the most appealing 
definitions. He emphasized that the slow approach of the entropy to its extensive 
limit is a sign of complexity, and has proposed several functions to analyze this 
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slow approach. His effective measure complexity is the subextensive entropy term of 
an infinite data sample. Unlike Crutchfield et al., he allows this measure to grow 
to infinity. As an example, for low dimensional dynamical systems, the effec- 
tive measure complexity is finite whether the system exhibits periodic or chaotic 
behavior, but at the bifurcation point that marks the onset of chaos, it diverges 
logarithmically. More interestingly, Grassberger also notes that simulations of 
specific cellular automaton models that are capable of universal computation in- 
dicate that these systems can exhibit an even stronger, power-law, divergence. 

Grassberger (1986) also introduces the true measure complexity, which is the 
minimal information one needs to extract from the past in order to provide opti- 
mal prediction. This quantity is exactly the statistical complexity of Crutchfield 
et al., and the two approaches are actually much closer than they seem. The re- 
lation between the true and the effective measure complexities, or between the 
statistical complexity and the excess entropy, closely parallels the idea of extract- 
ing or compressing relevant information (Tishby et al. 1999, Bialek and Tishby, in 
preparation), as discussed below. 

2.5.3 A unique measure of complexity? 

We recall that entropy provides a measure of information that is unique in sat- 
isfying certain plausible constraints (Shannon 1948). It would be attractive if we 
could prove a similar uniqueness theorem for the predictive information, or any 
part of it, as a measure of the complexity or richness of a time dependent signal 
x(0 < t < T) drawn from a distribution P\x{t)]. Before proceeding with such 
an argument we have to ask, however, whether we want to attach measures of 
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complexity to a particular signal x{t), or whether we are interested in measures 
(like the entropy itself) which constitute an average over the ensemble P[x(t)]. 

In most cases, including the learning problems discussed above, it is clear 
that we want to measure complexity of the dynamics underlying the signal, or 
equivalently the complexity of a model that might be used to describe the signal. 
This is very different from trying to define the complexity of a single realization, 
because there can be atypical data streams. Either we must treat atypicality ex- 
plicitly, arguing that atypical data streams from one source should be viewed as 
typical streams from another source, as discussed by Vitanyi and Li (2000), or we 
have to look at average quantities. Grassberger (1986) in particular has argued 
that our visual intuition about the complexity of spatial patterns is an ensemble 
concept, even if the ensemble is only implicit [see also Tong in the discussion ses- 
sion of Rissanen (1987)]. So we shall search for measures of complexity that are 
averages over the distribution P[x(t)]. 

Once we focus on average quantities, we can start by adopting Shannon's 
postulates as constraints on a measure of complexity: if there are A^ equally likely 
signals, then the measure should be monotonic in A^; if the signal is decompos- 
able into statistically independent parts then the measure should be additive with 
respect to this decomposition; and if the signal can be described as a leaf on a 
tree of statistically independent decisions then the measure should be a weighted 
sum of the measures at each branching point. We believe that these constraints 
are as plausible for complexity measures as for information measures, and it is 
well known from Shannon's original work that this set of constraints leaves the 
entropy as the only possibility. Since we are discussing a time dependent signal, 
this entropy depends on the duration of our sample, S{T). We know of course 
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that this cannot be the end of the discussion, because we need to distinguish be- 
tween randomness (entropy) and complexity. The path to this distinction is to 
introduce other constraints on our measure. 

First we notice that if the signal x is continuous, then the entropy is not in- 
variant under transformations of x. It seems reasonable to ask that complexity 
be a function of the process we are observing and not of the coordinate system 
in which we choose to record our observations. The examples above show us, 
however, that it is not the whole function S{T) which depends on the coordinate 
system for xj[^ it is only the extensive component of the entropy that has this non- 
invariance. This can be seen more generally by noting that subextensive terms in 
the entropy contribute to the mutual information among different segments of 
the data stream (including the predictive information defined here), while the ex- 
tensive entropy cannot; mutual information is coordinate invariant, so all of the 
noninvariance must reside in the extensive term. Thus, any measure complexity 
that is coordinate invariant must discard the extensive component of the entropy. 

The fact that extensive entropy cannot contribute to complexity is discussed 
widely in the physics literature (Bennett 1990), as our short review above shows. 
To statisticians and computer scientists, who are used to Kolmogorov's ideas, this 
is less obvious. However, Rissanen (1986, 1987) also talks about "noise" and "use- 
ful information" in a data sequence, which is similar to splitting entropy into its 
extensive and the subextensive parts. His "model complexity," aside from not be- 
ing an average as required above, is essentially equal to the subextensive entropy. 

Similarly, Whittle [in the discussion of Rissanen (1987)] talks about separating the 

^^Here we consider instantaneous transformations of x, not filtering or other transformations 
that mix points at different times. 
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predictive part of the data from the rest. 

If we continue along these lines, we can think about the asymptotic expansion 
of the entropy at large T. The extensive term is the first term in this series, and 
we have seen that it must be discarded. What about the other terms? In the con- 
text of learning a parameterized model, most of the terms in this series depend 
in detail on our prior distribution in parameter space, which might seem odd for 
a measure of complexity. More generally, if we consider transformations of the 
data stream x{t) that mix points within a temporal window of size r, then for 
T >> r the entropy S(T) may have subextensive terms which are constant, and 
these are not invariant under this class of transformations. On the other hand, if 
there are divergent subextensive terms, these are invariant under such temporally 
local transformations.^ So if we insist that measures of complexity be invariant 
not only under instantaneous coordinate transformations, but also under tempo- 
rally local transformations, then we can discard both the extensive and the finite 
subextensive terms in the entropy, leaving only the divergent subextensive terms 
as a possible measure of complexity. 

An interesting example of these arguments is provided by the statistical me- 
chanics of polymers. It is conventional to make models of polymers as random 
walks on a lattice, with various interactions or self avoidance constraints among 
different elements of the polymer chain. If we count the number J\f of walks 
with N steps, we find that A/'(iV) ~ AN'^z^ (de Germes 1979). Now the entropy 
is the logarithm of the number of states, and so there is an extensive entropy 
So = log2 z, a constant subextensive entropy logg A, and a divergent subextensive 

^^Throughout this discussion we assume that the signal x at one point in time is finite dimen- 
sional. There are subtleties if we allow x to represent the configuration of a spatially infinite 
system. 
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term Si{N) -^ 7log2 A^. Of these three terms, only the divergent subextensive 
term (related to the critical exponent 7) is universal, that is independent of the 
detailed structure of the lattice. Thus, as in our general argument, it is only the 
divergent subextensive terms in the entropy that are invariant to changes in our 
description of the local, small scale dynamics. 

We can recast the invariance arguments in a slightly different form using the 
relative entropy. We recall that entropy is defined cleanly only for discrete pro- 
cesses, and that in the continuum there are ambiguities. We would like to write 
the continuum generalization of the entropy of a process x{t) distributed accord- 
ing to P[x{t)] as 

^cont = - / Dx{t) P[x{t)] l0g2 P[x{t)], (2.100) 

but this is not well defined because we are taking the logarithm of a dimensionful 
quantity. Shannon gave the solution to this problem: we use as a measure of in- 
formation the relative entropy or KL divergence between the distribution P[x(t)] 
and some reference distribution Q[x{t)], 

Srel = -J Dx{t) P[x{t)] log2 {^^^ ^ (2-101) 

which is invariant under changes of our coordinate system on the space of sig- 
nals. The cost of this invariance is that we have introduced an arbitrary dis- 
tribution Q\x(t)\, and so really we have a family of measures. We can find a 
unique complexity measure within this family by imposing invariance principles 
as above, but in this language we must make our measure invariant to different 
choices of the reference distribution Q\x{t)]. 

The reference distribution Q[x{t)\ embodies our expectations for the signal 
x{t); in particular, ^rei measures the extra space needed to encode signals drawn 
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from the distribution P[x{t)] if we use coding strategies that are optimized for 
Q[x(t)]. If x(t) is a written text, two readers who expect different numbers of 
spelling errors will have different Qs, but to the extent that spelling errors can 
be corrected by reference to the immediate neighboring letters we insist that any 
measure of complexity be invariant to these differences in Q. On the other hand, 
readers who differ in their expectations about the global subject of the text may 
well disagree about the richness of a newspaper article. This suggests that com- 
plexity is a component of the relative entropy that is invariant under some class 
of local translations and misspellings. 

Suppose that we leave aside global expectations, and construct our reference 
distribution Q[x(t)] by allowing only for short ranged interactions — certain letters 
tend to follow one another, letters form words, and so on, but we bound the range 
over which these rules are applied. Models of this class cannot embody the full 
structure of most interesting time series (including language), but in the present 
context we are not asking for this. On the contrary, we are looking for a measure 
that is invariant to differences in this short ranged structure. In the terminology 
of field theory or statistical mechanics, we are constructing our reference distribu- 
tion Q[x{t)] from local operators. Because we are considering a one dimensional 
signal (the one dimension being time), distributions constructed from local op- 
erators cannot have any phase transitions as a function of parameters; again it is 
important that the signal x at one point in time is finite dimensional. The absence 
of critical points means that the entropy of these distributions (or their contri- 
bution to the relative entropy) consists of an extensive term (proportional to the 
time window T) plus a constant subextensive term, plus terms that vanish as T 
becomes large. Thus, if we choose different reference distributions within the 
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class constructible from local operators, we can change the extensive component 
of the relative entropy, and we can change constant subextensive terms, but the 
divergent subextensive terms are invariant. 

To summarize, the usual constraints on information measures in the contin- 
uum produce a family of allowable complexity measures, the relative entropy 
to an arbitrary reference distribution. If we insist that all observers who choose 
reference distributions constructed from local operators arrive at the same mea- 
sure of complexity, or if we follow the first line of arguments presented above, 
then this measure must be the divergent subextensive component of the entropy 
or, equivalently, the predictive information. We have seen that this component 
is connected to learning, quantifying the amount that can be learned about dy- 
namics that generate the signal, and to measures of complexity that have arisen 
in statistics and in dynamical systems theory. 

2.6 Discussion 

We have presented predictive information as a characterization of data streams. 
In the context of learning, predictive information is related directly to general- 
ization. More generally, the structure or order in a time series or a sequence is 
related almost by definition to the fact that there is predictability along the se- 
quence. The predictive information measures the amount of such structure, but 
doesn't exhibit the structure in a concrete form. Having collected a data stream 
of duration T, what are the features of these data that carry the predictive infor- 
mation IprcdiT)? From Eq. (|2.9D we know that most of what we have seen over 
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the time T must be irrelevant to the problem of prediction, so that the predic- 
tive information is a small fraction of the total information; can we separate these 
predictive bits from the vast amount of nonpredictive data? 

The problem of separating predictive from nonpredictive information is a spe- 
cial case of the problem discussed recently (Tishby et al. 1999, Bialek and Tishby, 
in preparation): given some data x, how do we compress our description of x 
while preserving as much information as possible about some other variable y? 
Here we identify x = Xpast as the past data and y = Xfuturc as the future. When we 
compress Xpast into some reduced description ajpast we keep a certain amount of 
information about the past, /(ajpast; ajpast)/ arid we also preserve a certain amount 
of information about the future, /(ajpast; ajfuturc)- There is no single correct com- 
pression Xpast -^ Xpast', instead there is a one parameter family of strategies which 
trace out an optimal curve in the plane defined by these two mutual informations, 

^ (.-^past) -^future j VS. I (^3Jpasti -^pastj- 

The predictive information preserved by compression must be less than the 
total, so that /(ajpast; a:futurc) < IpmdiT). Generically no compression can preserve 
all of the predictive information so that the inequality will be strict, but there are 
interesting special cases where equality can be achieved. If prediction proceeds 
by learning a model with a finite number of parameters, we might have a re- 
gression formula that specifies the best estimate of the parameters given the past 
data; using the regression formula compresses the data but preserves all of the 
predictive power. In cases like this (more generally, if there exist sufficient statis- 
tics for the prediction problem) we can ask for the minimal set of ajpast such that 
Iixpi^t',xiutmc) = IpmdiT). The entropy of this minimal Xpast is the true measure 
complexity defined by Grassberger (1986) or the statistical complexity defined by 
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Crutchfield and Young (1989) [in the framework of the causal states theory a very 
similar comment was made recently by Shalizi and Crutchfield (2000)]. 

In the context of statistical mechanics, long range correlations are charac- 
terized by computing the correlation functions of order parameters, which are 
coarse-grained functions of the system's microscopic variables. When we know 
something about the nature of the order parameter (e. g., whether it is a vector 
or a scalar), then general principles allow a fairly complete classification and de- 
scription of long range ordering and the nature of the critical points at which 
this order can appear or change. On the other hand, defining the order param- 
eter itself remains something of an art. For a ferromagnet, the order parameter 
is obtained by local averaging of the microscopic spins, while for an antiferro- 
magnet one must average the staggered magnetization to capture the fact that 
the ordering involves an alternation from site to site, and so on. Since the order 
parameter carries all the information that contributes to long range correlations 
in space and time, it might be possible to define order parameters more generally 
as those variables that provide the most efficient compression of the predictive 
information, and this should be especially interesting for complex or disordered 
systems where the nature of the order is not obvious intuitively; a first try in this 
direction was made by Bruder (1998). At critical points the predictive information 
will diverge with the size of the system, and the coefficients of these divergences 
should be related to the standard scaling dimensions of the order parameters, but 
the details of this connection need to be worked out. 

If we compress or extract the predictive information from a time series we are 
in effect discovering "features" that capture the nature of the ordering in time. 
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Learning itself can be seen as an example of this, where we discover the parame- 
ters of an underlying model by trying to compress the information that one sam- 
ple of A^ points provides about the next, and in this way we address directly the 
problem of generalization (Bialek and Tishby, in preparation). The fact that (as 
mentioned above) nonpredictive information is useless to the organism suggests 
that one crucial goal of neural information processing is to separate predictive in- 
formation from the background. Perhaps rather than providing an efficient rep- 
resentation of the current state of the world — as suggested by Attneave (1954), 
Barlow (1959, 1961), and others (Atick 1992) — the nervous system provides an ef- 
ficient representation of the predictive information.[^ It should be possible to test 
this directly by studying the encoding of reasonably natural signals and asking if 
the information which neural responses provide about the future of the input is 
close to the limit set by the statistics of the input itself, given that the neuron only 
captures a certain number of bits about the past. Thus we might ask if, under 
natural stimulus conditions, a motion sensitive visual neuron captures features 

of the motion trajectory that allow for optimal prediction or extrapolation of that 

^■^If, as seems likely, the stream of data reaching our senses has diverging predictive information 
then the space required to write down our description grows and grows as we observe the world 
for longer periods of time. In particular, if we can observe for a very long time then the amount 
that we know about the future will exceed, by an arbitrarily large factor, the amount that we 
know about the present. Thus representing the predictive information may require many more 
neurons than would be required to represent the current data. If we imagine that the goal of 
primary sensory cortex is to represent the current state of the sensory world, then it is difficult 
to understand why these cortices have so many more neurons than they have sensory inputs. In 
the extreme case, the region of primary visual cortex devoted to inputs from the fovea has nearly 
30,000 neurons for each photoreceptor cell in the retina (Hawken and Parker 1991); although 
much remains to be learned about these cells, it is difficult to imagine that the activity of so many 
neurons constitutes an efficient representation of the current sensory inputs. But if we live in a 
world where the predictive information in the movies reaching our retina diverges, it is perfectly 
possible that an efficient representation of the predictive information available to us at one instant 
requires thousands of times more space than an efficient representation of the image currently 
falling on our retina. 
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trajectory; by using information theoretic measures we both test the "efficient 
representation" hypothesis directly and avoid arbitrary assumptions about the 
metric for errors in prediction. For more complex signals such as communication 
sounds, even identifying the features that capture the predictive information is 
an interesting problem. 

It is natural to ask if these ideas about predictive information could be used to 
analyze experiments on learning in animals or humans. We have emphasized the 
problem of learning probability distributions or probabilistic models rather than 
learning deterministic functions, associations or rules. It is known that the ner- 
vous system adapts to the statistics of its inputs, and this adaptation is evident 
in the responses of single neurons (Smimakis et al. 1996, Brenner et al. 2000); 
these experiments provide a simple example of the system learning a parameter- 
ized distribution. When making saccadic eye movements, human subjects alter 
their distribution of reaction times in relation to the relative probabilities of dif- 
ferent targets, as if they had learned an estimate of the relevant likelihood ratios 
(Carpenter and Williams 1995). Humans also can learn to discriminate almost 
optimally between random sequences (fair coin tosses) and sequences that are 
correlated or anticorrelated according to a Markov process; this learning can be 
accomplished from examples alone, with no other feedback (Lopes and Oden 
1987). Acquisition of language may require learning the joint distribution of suc- 
cessive phonemes, syllables, or words, and there is direct evidence for learning 
of conditional probabilities from artificial sound sequences, both by infants and 
by adults (Saffran et al. 1996; 1999). These examples, which are not exhaustive, 
indicate that the nervous system can learn an appropriate probabilistic modelj^ 

^''As emphasized above, many other learning problems, including learning a function from 
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and this offers the opportunity to analyze the dynamics of this learning using in- 
formation theoretic methods: What is the entropy of N successive reaction times 
following a switch to a new set of relative probabilities in the saccade experi- 
ment? How much information does a single reaction time provide about the 
relevant probabilities? Following the arguments above, such analysis could lead 
to a measurement of the universal learning curve A(A^). 

The learning curve A(A^) exhibited by a human observer is limited by the pre- 
dictive information in the time series of stimulus trials itself. Comparing A(A^) to 
this limit defines an efficiency of learning in the spirit of the discussion by Barlow 
(1983); while it is known that the nervous system can make efficient use of avail- 
able information in signal processing tasks [cf. Chapter 4 of Rieke et al. (1997)], it 
is not known whether the brain is an efficient learning machine in the analogous 
sense. Given our classification of learning tasks by their complexity, it would be 
natural to ask if the efficiency of learning were a critical function of task complex- 
ity: perhaps we can even identify a limit beyond which efficient learning fails, 
indicating a limit to the complexity of the internal model used by the brain dur- 
ing a class of learning tasks. We believe that our theoretical discussion here at 
least frames a clear question about the complexity of internal models, even if for 
the present we can only speculate about the outcome of such experiments. 

An important result of our analysis is the characterization of time series or 
learning problems beyond the class of finitely parameterizable models, that is the 
class with power-law divergent predictive information. Qualitatively this class is 

more complex than any parametric model, no matter how many parameters there 

noisy examples, can be seen as the learning of a probabilistic model. Thus we expect that this 
description applies to a much wider range of biological learning tasks. 
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may be, because of the more rapid asymptotic growth of /pred(A^)- On the other 
hand, with a finite number of observations A^, the actual amount of predictive in- 
formation in such a nonparametric problem may be smaller than in a model with 
a large but finite number of parameters. Specifically, if we have two models, one 
with Iprcd{N) ~ AN" and one with K parameters so that Iprcd{N) ~ {K/2) loga A^, 
the infinite parameter model has less predictive information for all N smaller 
than some critical value 



N. 



K ^ fK^^^'" 
loga ' 



2Av °^ \2A 



(2.102) 



In the regime A^ ^ N^, it is possible to achieve more efficient prediction by trying 
to learn the (asymptotically) more complex model, as we illustrate concretely in 



numerical simulations of the density estimation problem. Section ^^. Even if 
there are a finite number of parameters — such as the finite number of synapses in 
a small volume of the brain — this number may be so large that we always have 
N -C Nc, so that it may be more effective to think of the many parameter model 
as approximating a continuous or nonparametric one. 

It is tempting to suggest that the regime A^ << A^c is the relevant one for much 
of biology. If we consider, for example, 10 mm^ of inferotemporal cortex devoted 
to object recognition (Logothetis and Sheinberg 1996), the number of synapses is 
A' ~ 5 X 10^. On the other hand, object recognition depends on foveation, and 
we move our eyes roughly three times per second throughout perhaps 15 years 
of waking life during which we master the art of object recognition. This limits 
us to at most A^ ~ 10^ examples. Remembering that we must have u < I, even 
with large values of A Eq. ( |2.102[ ) suggests that we operate with A^ < Nc. One 



can make similar arguments about very different brains, such as the mushroom 
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bodies in insects (Capaldi, Robinson and Fahrbach 1999). If this identification 
of biological learning with the regime A^ << A'^c is correct, then the success of 
learning in animals must depend on strategies that implement sensible priors 
over the space of possible models. 

There is one clear empirical hint that humans can make effective use of models 
that are beyond finite parameterization (in the sense that predictive information 
diverges as a power-law), and this comes from studies of language. Long ago. 
Shannon (1951) used the knowledge of native speakers to place bounds on the 
entropy of written English, and his strategy made explicit use of predictability. 
Shannon showed A^-letter sequences to native speakers (readers), asked them to 
guess the next letter, and recorded how many guesses were required before they 
got the right answer. Thus each letter in the text is turned into a number, and the 
entropy of the distribution of these numbers is an upper bound on the conditional 
entropy ^(A^) [cf. Eq. ( [2.1(J[ )]. Shannon himself thought that the convergence as 
N becomes large was rather quick, and quoted an estimate of the extensive en- 
tropy per letter Sq. Many years later, Hilberg (1990) reanalyzed Shannon's data 
and found that the approach to extensivity in fact was very slow: certainly there 
is evidence for a large component Si{N) oc N^^'^, and this may even dominate 
the extensive component for accessible N. Ebeling and Poschel (1994; see also 
Poschel, Ebeling, and Rose 1995) studied the statistics of letter sequences in long 
texts (like Moby Dick) and found the same strong subextensive component. It 
would be attractive to repeat Shannon's experiments with a slightly different de- 
sign that emphasizes the detection of subextensive terms at large N!^ 

^^ Associated with the slow approach to extensivity is a large mutual information between 
words or characters separated by long distances, and several groups have found that this mu- 
tual information declines as a power law. Cover and King (1978) criticize such observations by 
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In summary, we believe that our analysis of predictive information solves the 
problem of measuring the complexity of time series. This analysis unifies ideas 
from learning theory, coding theory, dynamical systems, and statistical mechan- 
ics. In particular we have focused attention on a class of processes that are qual- 
itatively more complex than those treated in conventional learning theory, and 
there are several reasons to think that this class includes many examples of rele- 
vance to biology and cognition. 



noting that such behavior is impossible in Markov chains of arbitrary order. While it is possi- 
ble that existing mutual information data have not reached asymptotia, the criticism of Cover 
and King misses the possibility that language is not a Markov process. Of course it cannot be 
Markovian if it has a power-law divergence in the predictive information. 



Chapter 3 

Learning continuous distributions: 
Simulations with field theoretic 

priors 



3.1 Occam factors in statistics 

As we have discussed extensively in the preceding Chapter, one of the central 
problems in learning is to balance 'goodness of fit' criteria against the complexity 
of models. An important development in the Bayesian approach thus was the re- 
alization that there does not need to be any extra penalty for model complexity: 
if we compute the total probability that data are generated by a model, there is a 
factor from the volume in parameter space — the "Occam factor" — that discrimi- 
nates against more complex models (MacKay 1992, Balasubramanian 1997). This 
works remarkably well for systems with a finite number of parameters and cre- 
ates a complexity "razor" (named after "Occam's razor") that is equivalent to the 
model complexity of the celebrated Minimal Description Length (MDL) principle 
(Rissanen 1989, 1996). It is not clear, however, what happens if we leave this finite 

77 
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dimensional setting and consider nonparametric problems such as the estimation 
of a smooth probability density. 

As we have emphasized, the behavior of the predictive information, Eq. (|2.^), 
is controlled by the density of models, and therefore the predictive information 
is closely related to the Occam factor. Since the density and consequently /prcd 
are well defined for the finite parameter as well as for the nonparametric cases 
(cf . Sections [2.4.2| and |2.4.6| ) one can hope that a nonparametric analogue of the 



Occam factor exists and can do its job of punishing complexity. However, since 
in these two cases the densities of models are very different, the Occam factor 
details certainly must be different too. 

The 1996 formulation of nonparametric learning by Bialek, Callan, and Strong, 
which we have summarized in Appendix [A.l| and investigated further in Sec- 



tion |2.4.7| , may serve as a good example in which to study infinite dimensional 
Occam factors. In this Bayesian quantum field theory formulation, standard field 
theory methods may be used not only to find a nowhere singular estimate of a 
continuous density, but also to compute the infinite dimensional analog of the 
Occam factor, at least asymptotically for large numbers of samples. This factor, 
which we also call the fluctuation determinant, is the second term of the effective 
Hamiltonian Eqs. (|A7t g!92D 

1 /AT \ 1/2 r 

R=2[jr) J d^^M-Mx)/2]. (3.1) 

Intuitively, smaller values of / allow more rapidly varying and thus more com- 
plex [as measured by the predictive information, Eqs. (|2.97| , |2.9^ )] estimates of the 



density. Correspondingly, the infinite dimensional Occam factor is bigger and 
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thus exponentially punishes more complex models. As Bialek et al. have spec- 
ulated, /, the only free parameter of their theory, can be determined by a fight 
between the log-likelihood goodness of fit and the Occam factor to provide for 
the shortest total description (highest probability) of the data, much like in the 
finite parameter MDL theory. However, their proposed scaling for /* (the best 
value of the parameter) as a function of A^, /* ~ N^^^ seems to be over-universal 
and requires further analysis. 

There are more questions not clearly answered either by the original work, 
or its further developments (Periwal 1997, 1998, Holy 1997, Aida 1999). Can this 
method be implemented in practice? Can we really use the infinite dimensional 
Occam factor to balance against the goodness of fit? How does the algorithm's 
performance compare to the absolute bounds set by the predictive information? 
What happens if the learning problem is strongly atypical of the prior distribu- 
tion? And what is the role of the Occam factor in this case? 

To answer all of these questions we turn to numerical simulations. 

3.2 The algorithm 

To simplify the algorithm, maximize the speed of simulations, and shorten our 
presentation, we do the numerical analysis only in the framework of the original 
paper (Bialek, Callan, and Strong 1996, see also Appendix |A.1| and Section [2.4.71 ). 



This may seem too specific, but we believe that our results are very general and 
will hold for the alternative formulations of Periwal (1997, 1998) and Holy (1997) 
since the mechanisms of regularization and complexity control are everywhere 
the same. 
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Due to our most recent developments (cf. Section [2.4.7D and the specific ques- 



tions we ask, we need to modify the original setup slightly before proceeding. 
First of all, we will investigate the performance of the method in many different 
learning problems, some of them not characteristic of the prior Eq. ( |A.5D . For 



these purposes we will take densities at random from an 'actual' a priori distri 
bution that minimally generalizes Eq. (|A.5| ), 



P[0(x)] = — exp 



him 4i^/--^'''- 



(3.2) 



2 

Here r/^ > 1/2 to ensure UV convergence, Z is the normalization constant, and 
the (5-function enforces normalization of Q. We refer to la and rfa as the smoothness 
scale and the exponent, respectively, and the subscript a stands for 'actual'. We will 
use non-subscripted i] and / to indicate the parameters the algorithm uses, that 
is, the learning machine's own a priori expectations, and then rja = i] = 1 and 
la = I reduces to the original formulation of Bialek et al. 

The other modification we make relates to the problem of the infrared diver- 
gence of the predictive information, Eq. (|2.97D , or, equivalently, to the nonuniform 
convergence of the estimate (5cst(a;) to the target P{x), Eq. ( [A.!!] ). To cure this we 



can put the system in the box of size L, just like we did in Section |2.4.7| . Also, 
we realize that the variance of fluctuations between the target and the estimate 
(Bialek et al. 1996) is just an ad hoc measure of performance of the learning ma- 
chine. The universal learning curve A{N), Eq. ( |2.13| ), is a much better choice. For a 
proper Bayesian learning with the prior Eq. ( |A.5D , using Eqs. ( |2.51[ , [2.98D , we write 



A(iV) ^ ((/^KL[P(x)||Qest(x)]),,.)^°^ ~ ]J^, (3.3) 



where (• ■ ■) means an average over the prior, and the log 2 factor is omitted 
because we choose to measure entropies in nats (that is, use natural logarithms) 
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in this Chapter. Note that the coefficient in front of the square root is probably 
meaningless since it is calculated here only to the zeroth order (see Section [2.4.7[ ). 
After these modifications, the algorithm to implement the theory is rather sim- 
ple. We need to solve the second order differential equation [cf. Eq. ( |A.8| )] 



WlMx) + - exp [-Mx)] =Y.^i^- ^j) ■ (3-4) 



k 



Normalization of Qd fixes one integration constant, and the other is set by a peri- 
odicity constraint for 0ci, 

^{x) = 4>{x + L), (3.5) 

which is due to x being in a box. The resulting boundary value problem is solved 
by a standard 'relaxation' (or Newton) method of iterative improvements to a 
guessed solution (see, for example. Press et al. 1988). The target precision is al- 
ways 10~^, which is smaller than the smallest Dkl ~ 10^"^ we intend to investi- 
gate. It turns out that the method converges regardless of the initial guess for all 
/ up to ~ 5. However, convergence is not uniform in / and, as / ^ 0, the number 
of iterations required to reach the same precision grows almost quadratically in 
l/l. The independent variable x E [0, L] is discretized in equal steps to ensure 
stability of the method. We expect the estimate distribution to vary over a local 
length scale [Bialek et al. 1996, cf. Eq. ( |70D ] 

e(x) ~ [l/NQ,,,{x)]'/' ^ [l/NP{x)]'/' . (3.6) 

Empirically we see that, for small /, the maximal value of the target distribu- 
tion P(x) grows approximately as ~ /^^/^. This means that for Figs. ([3.1[-|3.4[), 
where A^ < 10^ and / > 0.05, we are safe with 10"^ grid points. Similarly, for 
Figs. ( P3| . P^ , we need 10^ discretization steps because A^ = 10® and / = 0.01 are 



present there. 
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Since the prior we use, Eq. (|3.2|), is UV convergent, we can generate random 
probability densities from it by replacing (p with its Fourier series and truncating 
the latter at some sufficiently high wavenumber kc, 



0(x) = E 

fc=0 



2'n'kx 2'Kkx 

Ak cos — h i^fc sin 



(3.7) 



L ^ L 

Then Eq. ( p.2D enforces the amplitudes of the k'th mode to be distributed normally 
around zero with the standard deviation 

In addition, the amplitude of the zeroth mode, Aq, is always set by the normal- 
ization constraint. For the same sets of figures, it is enough to have k^ = 1000 and 
5000 respectively, and then we should see very little effects associated with the 
finiteness of kc- 

Coded in such a way, the simulations are extremely computationally inten- 
sive. Therefore, the Monte Carlo averages given here are only over 500 runs, 
and fluctuation determinants are calculated according to Eq. (^7\^ , but not using 
numerical path integration. 

3.3 Learning with the correct prior 



As an example of the algorithm's performance. Fig. (pJ[) shows one particular 
learning run for r] = r]a = I and / = la = 0.2. We see that singularities and 
overfitting are absent even for N as low as 10. Moreover, the approach of Qdix) 
to the actual distribution P(x) is remarkably fast: for A^ = 10, they are similar; 
for A^ = 1000, very close; for A^ = 100000, one needs to look carefully to see the 
difference between the two. 
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Figure 3.1: Qci found for different N atl = 0.2. 



The next question on our list is the behavior of the learning curves. For the 



same r] and / = 0.4, 0.2, 0.05, these are shown on Fig. (p^. One sees that the 
exponents are extremely close to the expected 1/2, and the ratios of the pref actors 
are within the errors from the predicted scaling ~ 1/Vl. All of this means that 
the proposed algorithm for finding densities not only works as predicted, but 
is, at most, a constant factor away from being optimal in using the predictive 
information of the sample set. 

Note also that the data points approach their asymptotic regimes very differ- 
ently for different values of /: the bigger / is, the lower the data start compared 
to their respective fits. This is explainable in view of the fact that smoother dis- 
tributions usually vary over smaller ranges. For example, for I = 0.4 the target 
distribution P(x) usually takes values from ~ 0.5 to ~ 2. On the other hand, for 
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Figure 3.2: A as a function of A^ and /. The best fits are: for / = 0.4, A = (0.54 ± 
0.07)A^-o-4S3±o.oi4. foj. I ^ 0.2, A = (0.83 ± 0.08)A^-o-493±o.o9. ^j. i ^ g.OS, A = 
(1.64±0.16)Ar-o-507±o.o9_ 

not too large A^ the estimates are also smooth and close to being uniform. There- 
fore, the KL divergence usually comes out small in this case. Thus the / = 0.4 data 
starts so low not because we manage to learn the distribution extremely well for 
A^ = 10, but because almost any guess is as good as any other at this level of 
detail. 



3.4 Learning with 'wrong' priors 

We stress first that there is no such thing as a wrong prior. If one admits the 
possibility of a prior being wrong, then that prior does not encode all of our 
a priori knowledge! It does make sense, however, to ask what happens if the 
distribution we are trying to learn is an extreme outlier in the prior V[(f){x)]. One 
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Figure 3.3: A as a function of A^ and la- Best fits are: for /„ = 0.4, A = (0.56 ± 
0.08)Ar-o-477±o.oi5. foj. i^ ^ 0.05, A = (1.90 ± o.l6)A^-o.502±o.oo8. foj. variable /„, 

A = (1.28 ± 0.13)A^~°-^9s±°-°^^ In all cases we learn with / = 0.2. 

way to generate such an example is to choose a typical function from a different 
prior P'[0(x)], and this is what we mean by 'learning with an incorrect prior.' To 
study this we learn using r] = 1 and /, but we choose our target distributions from 
the prior Eq. ( p^ with different values of rja and /„• 

If the prior is wrong in the above sense, and the learning process is as usual, 
Eqs. (|A.3| , |A.6| - |A.8|) , then we still expect the asymptotic behavior, Eq. (p3|) , to hold. 
Indeed, once 0ci becomes close to — log P, it takes the same time to discover that 
the distribution's features at the current relevant scale ^(A^) are as expected, too 
big, or almost absent. Thus only the prefactors of A should change, and those 
must increase because there is an obvious advantage in having the right prior. 



We illustrate these points in Figs. ( p3| , pT4| ) 
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Figure (|3.3|) shows the learning curve for distributions generated with the 'ac- 
tual' smoothness scale la = 0.4, 0.05 and studied using the 'learning' smoothness 
scale / = 0.2 (we show the case /„ = / = 0.2 again as a reference). The A ~ 1/VN 
behavior is seen unmistakably. The prefactors are a bit larger (unfortunately, in- 
significantly) than the corresponding ones from Fig. p.2|), so we may expect that 
the 'right' /, indeed, provides better learning (see Section ^3] for a detailed dis- 
cussion). Finally, the approach to the asymptotes again is different for the differ- 
ent examples considered, but it is still explainable by the argument we used for 
Fig. O). 

To generate outliers that are even more uncommon than the ones just dis- 
cussed one may want to distort the x axis (use different parameterization), and 
this results in a variable smoothness scale la{x). As an example. Fig. ( jX^ ) shows 
the learning curve for /„ = 0.2 distributions that have been rescaled according 
to X -^ X — 0.9sin(27ra;/L). For the rescaled variable, la{x) varies from 0.02 to 
0.38. Two separate straight line fits — through the first five (shown) and the last 
four points — are possible for this data. Each of the fits separately agrees with the 
prediction, but their prefactors are different. This is probably just a numerical 
artifact because 1000 Fourier modes used here feel like much less in some places 
due to the rescaling, and this shows up at large enough A^. Alternatively, this 
may be an indication that a detailed analysis of the reparameterization invariant 
formulation (Periwal 1997, 1998) is needed. 

Finally, Fig. ( |3.4|) illustrates learning when not only /, but also 7] is 'wrong' in 
the sense defined above. We illustrate this for rja = 2, 0.8, 0.6, (remember that 
only r]a > 0.5 removes UV divergences). Again, the inverse square root decay 
of A should be observed, and this is evident for rja = 2. The rja = 0.8, 0.6, 
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Figure 3.4: A as a function of A^, rja and la- Best fits: for rja = 2, /^ = 0.1, A = 
(0.40 ± 0.05)Ar-o-493±o.oi3^ foj. ^^ _ g.S, /„ = 0.1, A = (1.06 ± 0.08)A^-o-355±o.oo8_ 

/ = 0.2 for all graphs, but the one with r]a = 0, for which / = 0.1. 



cases are different: even for A^ as high as 10^ the estimate of the distribution is 
far from the target, thus the asymptotic regime is not reached. This is a crucial 
observation for our subsequent analysis of the smoothness scale determination 
from the data (Section [3.5D . Remarkably, A (both averaged and in the single runs 
shown) is monotonic, so even in the cases of qualitatively less smooth distributions 
there is still no overfitting. On the other hand, A is well above the asymptote for 
r] = 2 and small A^, which means that initially too many details are expected and 
wrongfully introduced into the estimate, but then they are almost immediately 
(A^ ~ 300) eliminated by the data. 
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3.5 Selecting the smoothness scale 

Results presented in the last Figures already suggest that Occam factors should 
work in this infinite dimensional case, and that it indeed is possible to see this in 
numerical simulations. The competition between the data and the Occam factor 
is equivalent to minimizing the expression [cf . Eq. ( |A.6| , |A.7D ] 



r J ^ 1 IN r 

H*[(j),,; {xi}; l] = J dx-{d,(t>cxf + E '^cKa:,) + -J^ j ^xexp [-0ci(x)/2] , (3.9) 



which makes the total probability of the data maximal, and thus the length nee- 
ded to code it minimal. How does the smoothness scale /* that minimizes H* 



behave? The data term [second in Eq. (P^)] on average is equal to A^-Dkl(-P| IQci)/ 
and it can be small compared to the other terms for very regular P{x). Then only 
the kinetic and the fluctuation terms matter, and /* ~ N^/'^ , as was obtained by 
Bialek, Callan, and Strong (1996). For less regular distributions P{x) [cf. graphs 
for r]a = 0, 0.6, 0.8 on Fig. (|33^)], this is not true. Indeed, for rj = 1, Qci{x) approx- 
imates large scale features of P{x) very well, but details at scales smaller than 



ll/NL are not present in it. If P{x) is taken from the prior, Eq. (^^), charac- 
terized by some ?]„ and la, then according to Eq. ( [j.8D the contribution of these 
details falls off with the wave number fc as ~ {L/la)^''~^^'^k~'^''. Thus the expected 
data term is 

ND^^iPim - N 1^^ {^) ^--iV© (f) .(3.10) 

and this is not necessarily small. For rj^ < 1.5 it actually dominates the kinetic 
term and competes with the Occam factor to establish the relevant smoothness 
scale. Summarizing, 

/* ~ N^/\ r]a > 1.5 (3.11) 
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Figure 3.5: Smoothness scale selection by the data. The lines that go off the axis 
for small A^ symbolize that H* monotonically decreases as I ^ oo. 
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p^iva^i)/va^ 0.5<r/a<1.5 



(3.12) 



There are two noteworthy things about Eq. ( [3.121 ). First, for rja = rj = 1, I* sta- 
bilizes at some constant value, which we expect to be equal to la- Second, even 
if r]a i- T], but 7]a < 1.5, then Eqs. (^ pl2D ensure that A ~ N^/^^--^, and this 
asymptotic behavior will be reached almost immediately, unlike in the r]a = 0,0.6 
examples from Fig. ( |3.4|) . This performance is, at most, a constant factor away 
from the limits set by heuristic calculations of predictive information, Eq. ( |2.99| ), 
with the 'right' priors Va = V ¥" 1 — ^ remarkable result! 

We present simulations relevant to these predictions in Figs. ( ^3] , P^ . Unlike 
in the previous Figures, these results are not averaged due to extreme computa- 
tional costs, so all our further claims, which are inherently statistical, have to be 
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taken cautiously. On the other hand, selecting /* and observing the effects asso- 
ciated with it in single runs has some practical advantages since then we are able 
to ensure the best possible learning for any given realization of the data, not just 
on average. 

Figure ([3. 5]) shows single learning runs for various T]a and la- In addition, to 



keep the Figure readable, we do not show runs with r^^ = 0.6, 0.7, 1.2, 1.5, 3, and 
r]a -^ oo, which is a finitely parameterizable distribution. All of these display a 
good agreement with the predicted scaltngs, Eq. ( |3.11| , [3.12D . 



Figure ( p.6D shows the KL divergence between the target distribution and its 



classical estimate calculated at /*; the average of this divergence over the samples 
and the prior is the learning curve. Again, the predictions are clearly fulfilled. 
Note that for all r]a with exception oi r]a = r] = 1 there is indeed a qualitative 
advantage in using the data induced smoothness scale. To illustrate this more 
clearly and ease the comparison we replotted some of the curves with adaptive I 
side by side with their fixed / analogues on Fig. (^7\) . 



3.6 Can the 'wrong' prior help? 

The last four Figures have illustrated some aspects of learning with 'wrong' pri- 
ors. However, more importantly, all of our results may be considered as belong- 
ing to the 'wrong prior' class. Indeed, the actual probability distributions we 
used were not nonparametric continuous functions with smoothness constraints, 
but were composed of kc Fourier modes and thus had exactly 2kc parameters. 
Usually it would take well over 2kc samples to even start to close down on the 
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Figure 3.6: Learning with the data induced smoothness scale. 



actual value of the parameters, and many more to get accurate results. How- 
ever, using the wrong continuous parameterization [0(a;)] we were able to obtain 
good fits for as low as 1000 samples [cf. Figs. (pT])] with the help of the prior 
Eq. ( |A.5D . Moreover, learning happened continuously and monotonically with- 
out huge chaotic jumps of overfitting that necessarily accompany any brute force 
parameter estimation method at low A^. So, for some cases, a seemingly more com- 
plex model is actually easier to learn! 

We can summarize this by stating that, when data are scarce and the param- 
eters are abundant, one gains even by using the regularizing powers of wrong 
priors. The priors select some large scale features that are the most important to 
learn first and fill in the details as more data become available. If the global fea- 
tures are dominant (arguably, this is generic), one actually wins in the learning 
speed [cf. Figs. { ^2\ p3| , p^]. If, however, small scale details are as important. 



3.6. Can the 'wrong' prior help? 



92 



10" 



10 



10" 




T]^=0.8, /g=0.1, /=/* 

ri =0.8, /=0.1, 1=0.2 
a a 

11^=2, f=0.1, /=0.2 
'a a 



10' 



10^ 



10° 



N 



Figure 3.7: Comparison of learning speeds for the same data sets with different 
a priori assumptions. In all runs we learn using the model with i] = I, and / is 
either predefined, or set by the Occam factor. 

then one is at least guaranteed to avoid overfitting [cf. Fig. <^^]. 

Thus we can argue that our numerical experiments support the Occam-like 
claim we made in Section ^^ : if two models provide equally good fits to data, 
the simpler one should always be used. In particular, using Eq. ( |2.102D we see that for 
our simulations, the nonparametric QFT model is simpler (as characterized by the 
predictive information) than a finite dimensional one for N < Nc ^ (kclogkc)"^. 
We operate in this regime in all of our simulations, and so we must learn faster 
and with less overfitting if we use the wrong parameterization. Note, that these 
results are very much in the spirit of our whole program: not only is the value of 
/* selected that simplifies the description of the data, but the continuous param- 
eterization itself serves the same purpose. This is an unexpectedly neat general- 
ization of the MDL (Rissanen 1989) principle to nonparametric cases. 
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3.7 Discussion 

The field theoretic approach to density estimation in principle not only regular- 
izes the learning process but also allows the self-consistent selection of smooth- 
ness criteria through an infinite dimensional version of the Occam factors. We 
have shown numerically that this works, even more clearly than was conjectured. 
For rja < 1.5, Qcst and the learning curve A truly become properties of the data 
and not of the Bayesian prior used for learning: one can set a learning machine 
to work at ?7 = 1 and be sure that this does not bias the estimates in any excessive 
way. If we can extend these results to include r]a > 1.5 and combine this work 
with the reparameterization invariant formulation of Periwal (1997, 1998), this 
should give a complete theory of Bayesian learning for one dimensional distri- 
butions, and this theory has no arbitrary parameters. In addition, if this theory 
properly treats the limit rja -^ oo, we should be able to see how the well-studied 
finite dimensional Occam razors and the MDL principle arise from a more gen- 
eral nonparametric formulation. 

These results also have some biological implications. First of all, it may be 
that this smoothness scale adaptation mechanism is partly responsible for a com- 
monly known effect: children learn faster than adults. More seriously, and more 
closely connected to the models discussed here is the learning and development 
of smooth 'maps' in the nervous system (see, for example, Rnudsen et al. 1987). 
These maps become much less susceptible to the sensory inputs as time passes, 
and this may be interpreted in terms of stiffening of the smoothness constraint. 
Indeed, starting from scratch, it is very easy to drift the smoothness scale to such 
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large values that susceptibility of the learning machine (in other words, an ani- 
mal) to the new data will be extremely small. 

Second, if our conclusions are correct, then a learning machine that is pro- 
grammed to solve problems at r^ = 1 can easily solve more complex problems with 
any rja, 1.5 > rja > 0.5, by performing a simple averaging over the smoothness 
scales. At worst, this procedure may lead to a constant multiplicative drop in 
performance. By analogy, we may expect that, once an animal is able to treat a 
problem that falls in any power-law class with respect to the predictive informa- 
tion, then it is able to treat any problem that provides more predictive informa- 
tion with only a multiplicative overhead. Since, as we have already discussed 
(Hilberg 1990, Ebeling and Poschel 1994, Poschel, Ebeling, and Rose 1995), hu- 
mans can solve power-law problems, it is encouraging to know that there is no 
learning task in this world that is, in principle, too difficult for us (our lifetime is 
the only limiting factor). More seriously, if it is, indeed, possible to construct a 
complete theory of one (and, possibly, higher) dimensional learning, where both 
the smoothness scale and the exponent can be self-consistently determined, then 
the questions we asked in Section ^]6| (like "what models do humans use for learn- 
ing?") may lose their meaning — any model is almost as good as any other, and it 
is very difficult to look for possible multiplicative differences between them. Sur- 
prisingly, these questions are meaningful if such a theory cannot be constructed. 
In this case, as we saw in Eqs. ( p.ll[ . p.l2D and on Figs. ( P3| , P^ ), a complex learn- 
ing machine cannot effectively adapt to simple tasks. This again accords with 
our subjective experience that it is sometime very hard to find a simple solution 
when expecting a complex one. It should be possible to devise an experiment 
that would quantify this failure to solve simple problems in complex contexts. 



Chapter 4 



Learning discrete variables: 
Information-theoretic regularization 



4.1 The general paradigm 

In Chapters and ^ of this work we discussed what we consider to be some 
of the most interesting problems in modem statistics and learning theory. We 
defined predictive information and complexity, studied the learning of nonpara- 
metric distributions, and showed an example of how Occam factors generalized 
to infinite dimensional problems. There is one tantalizing question that followed 
us through this whole discussion: many problems require a prior to regularize 
learning, but then how can one make sure that the estimates and the conclusions 
are inferred from the data, rather then being imposed by some a priori knowledge 
that the data do not support? The results in the infinite dimensional generaliza- 
tion of the Occam factors seemed especially interesting in this respect — estimates 
can become almost insensitive even to the qualitative choice of prior, at least in 
a broad range. This conclusion, together with Periwal's (1997, 1998) work, are 
the only results known to us that aim towards constructing a reparameterization 

95 
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and prior invariant (or, better yet, ignorant) theory of nonparametric Bayesian 
inference. 

Even though we have concentrated on nonparametric problems, similar diffi- 
culties also exist in seemingly simpler, parametric cases. The choice of the prior 
for finite parameter learning scenarios is still a hot topic, and various alternatives 
are proposed that make some universal theoretical sense within the framework 
of information theory, MDL theory, etc. Examples include universal priors (Ris- 
sanen 1983, Lee and Vitanyi 1993) or Jeffreys' priors (Clarke and Barron 1994, 
Balasubramanian 1997). 

We think, however, that the prior really should embody a priori knowledge, 
and thus we cannot agree with the use of universal, globally definable choices. 
On the other hand, there is an obvious appeal for approaches based on more 
general theoretical principles. For example, the problems of nonuniform conver- 
gence of the estimate to the target, Eq. ( |A.11[ ), and of the infrared divergence of 
predictive information, Eq. ( |2.97D , are easily resolved if one turns to reparameter- 
ization invariant priors (Periwal 1997, 1998). These have a clear theoretical edge 
over non-invariant ones since they estimate a true density, that is, a function that 
transforms like a density under reparameterizations of the independent variable. 

As we tried to emphasize in this work, learning is information accrual, and, 
therefore, it is a part of Shannon's information theory. Thus a general theoretical 
principle can be to construct priors solely from information-theoretic quantities 
like entropy (or self-information), KuUback-Leibler divergence (or relative in- 
formation), various mutual informations, etc., and with no other constraints. In 
addition, since information-related quantities are formed from log-probabilities 
it makes sense to include them exponentially into the priors, which are, after 
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all, also probabilities. In Sections ^^ and ^!^ we will present a simple example 
that illustrates the use of this general principle — regularization with information- 
theoretic quantities — and show that it is possible and advantageous to learn with 
such priors. 

4.2 Discrete variables: a need for special attention 

When learning probability densities of continuous variables, except for the very 
simple cases N » dyc/ priors are used to balance the quality of fits to the data 
against the complexity of solutions (cf . Section [2.4.b| and Chapter ^; this smooth- 



ing of data prevents overfitting. It is easy to construct smoothing priors for con- 
tinuous variables: continuity implies a metric, so locality is defined, and then 
one just needs to punish distributions that exhibit large variations over small dis- 
tances. Such a 'smoothness' prior will work as a regularizes 

The case when a variable is discrete, and the (discrete) metric is impossible 
to define, presents a problem. Usually this case is not considered interesting, 
because the law of large numbers guarantees that, at large N, the frequencies 
of events estimate probabilities well. However, if the number of examples N 
is smaller or comparable to the number of possible outcomes K, then statistical 
fluctuations are large. This situation is not as uncommon as one might hope. For 
example, it is possible that syntactic structures in a language can be inferred from 
statistical arguments alone (see, for example, Pereira et al. 1993, Manning and 
Schiitze 1999). Estimating probabilities of occurrence of a few thousand common 
words is rather easy. It is even feasible to construct conditional distributions of 
nouns given verbs (Pereira et al. 1993). However, it is totally unrealistic to expect 



4.2. Discrete variables: a need for special attention 98 

to build an accurate probability distribution of whole sentences from the raw data 
without some a priori knowledge. 

Similar problems appear in molecular biology. For example, gene expression 
is governed by promoter regions in DNA molecules. These regions are usually 
thought to be constructed from two five base pair long blocks (there are 4^ pos- 
sible different realizations of these) that appear anywhere inside 'junk' genetic 
material, which is about a hundred bases in length. If one tries to find the mean- 
ingful 5 + 5 structures by statistical methods (see, for example, van Helden et 
al. 1998), then a probability distribution over 4^ * 4^ ~ 10^ possibilities must be 
constructed. Many experiments are done on yeast, and their genome is only a 
few millions of base pairs long. So getting a full statistics is, at best, problematic. 
Even worse, if one tries to look for a possibility of promoters of a different length, 
then the problem becomes totally hopeless. 

To solve these and similar problems, one needs smoothing priors that regu- 
larize fluctuations. However, now there is no notion of locality to create them. 
It is not at all evident how to impose smoothness conditions, or whether these 
conditions will speed up learning in any way. It is clear though that any smooth- 
ing must be global — we cannot talk about local changes, but global variability 
is well defined. Recall that variability can be measured non-metrically by en- 
tropies, and these have a very special meaning in information theory — they are 
the unique measure of information (Shannon 1948). Therefore, learning a dis- 
crete variable may turn out to be an excellent example of the general paradigm 
introduced above. 

In the next Sections we present a toy model for learning a discrete variable 
and show that it is possible to regularize and speed up such learning with the 
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help of information-theoretic priors. In general, discrete calculus is much less 
developed than its continuous analogue (precisely for the reason that the notion 
of locality is absent), so analytical results can usually be achieved only for very 
simple problems, and our example is like that. Nonetheless, this case is of interest 
as it solves some real world problems and, more importantly, because it develops 
techniques that later will be used in more complicated tasks; the work on those is 
already in progress. 

4.3 Toy model: theory 

Consider the following 'real world' problem. A US Census Bureau needs a pre- 
liminary report on statistics of people in Trenton, NJ based on the Census-2000 
data. Unfortunately, only a few thousand households have filled in their Cen- 
sus cards, and this is clearly not enough to sample adequately many classes 
X, X = 1 ■ ■ ■ K (we also call them possible outcomes or bins) into which the peo- 
ple are classified (ethnicity, marital status, educational level, size of the house- 
hold, etc.) Suppose now that Newark, NJ, perceived to have a similar population 
statistics, has been quick to organize door-to-door counting of people, so that 
a good sampling of Newark's population, Q*{x), is available. Can the Census 
Bureau statisticians use these data to answer their questions about the Trenton 
people? An obvious solution would be to take a weighted average of the (un- 
dersampled) Trenton counts and the (well sampled) Newark probabilities. But 
how should the weights be set in order to ensure that the Trenton estimate is just 
smoothed and not unfairly biased by the Newark data? 

We can offer a solution to the problem by choosing an a prior probability 



4.3. Toy model: theory 100 

density for Q{x), the Trenton distribution, that is biased towards the reference 
Newark distribution Q*{x). This may be done in the following form 

V[Q{x),X] = j^^exp[-XD{Q*\\Q)] 6 (r^Qi^) - '^] ^A), (4.1) 

where V{\) is some a priori normalized density for the inverse temperature-like 
parameter A, Zq{X) = J[dQ{x)]V[Q{x)\\X] is the normalization of the a priori dis- 
tribution of Q{x) conditional on A, and D is some measure of distance between 
the two distributions, Q* and Q. If we want to stick to our paradigm of using 
information-theoretic quantities only, then we do not have much freedom in se- 
lecting D since the natural distance between any two distributions in information 
theory is the familiar KuUback-Leibler divergence, 

D{Q*\\Q) ^ Dkl{Q*\\Q) = i:Q*(^)log^. (4.2) 

x=l ^[^) 

In the language of coding theory, this choice of D means that we optimize our 
coding strategies for Q, but we want Q's such that the data coming from the refer- 
ence distribution can be coded compactly also. We could have chosen to measure 
distances in the opposite direction and switch arguments in the KL divergence. 
Then the best coding for Q* is fixed, and we look for Q that is still coded well. We 
chose Eq. ( |4.2D over the other choice because this allows an exact solution. 

Now, similarly to Bialek et al. (1996, see also Appendix |A.1|) , we apply the 

Bayes formula to get the probability density for Q{x) and A given the data {a;i} 

P[{x;}\Qix),X]V[Qix),X] 



P[Q{x),X\{x,}] 



V[Q{x),X]UliQix;) 



J[dQ{x)]dXV[Q{x),X]UliQ{xi) 
The least square estimator of Q{x) is then, as usual, 

Qest(x|{xi}) = fdX[dQ{x)]Q{x)P[Q{x),X\{x,}] 



(4.3) 
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{Q{x)Q{x,)Q{x2)---Q{xr,)y<^''^ 

(g(xi)g(a;2)---g(x^))(«'^) ' ^'^ 

where (■ • •)('3.^) stands for averaging over Q and A with respect to the prior; simi- 
larly, (■ ■ •)'^'^^ means averaging only over P[g|A]. 

If A was fixed, then the averaging in Eq. ( |4.4| ) would have one integral less — a 
simpler problem. However, varying it may allow the Occam factor that arises 
from volumes in the Q-space to find some A* that creates the shortest (thus the 
most probable) description of the data (cf . Section P75D . By achieving the optimal 
balance between the 'goodness of fit' and closeness to the reference, this may 
resolve the problem of a possible erroneous bias towards Q*. 

As mentioned above, the solution of this model is rather simple. We write n{x) 
for the data count in the bin x, Y.x ^{^) = N- Then, leaving aside the A integral 
for a while, we have (see Appendix |A.2D : 

r e-^-^KL(Q*||Q) 
(Q(xi)---Q(x^))('3) = [dQ] 5{Y^Qix)-l)Y[Qixr^^^ (4.5) 

-' ^Q[^> X X 

S\Q*\ 

j[dQ]5{Y.Q{^) - l)nQW"^^^^''^*^^^ (4-6) 



^sm nf=ir(n(x) + Ag*(a; 



(4.7) 



Zq{X) V{\ + N + K) 

where S [Q*] is just the entropy of the reference distribution. Zq{X) is given by the 
same integral, Eq. ( [4.5[ , |4.7D , but with n{x) = N = 0. Therefore, if we now integrate 
over A using the steepest descent technique, then the most probable value of the 
inverse temperature is determined by 

A^ 1 K n{x) n*('r') 

y^ t y^ y^ =0 (4 8) 



Numerically (cf. Section ^^ this equation has one nontrivial solution for large 



A^, so that the Occam factor seems to work again. Unfortunately, however, we 
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were unable to obtain satisfying analytical results with exception of some trivial 
asymptotic limits. If, on the other hand, we keep A fixed, then [see Eq. (|0|)] 

The simplicity of this result is intriguing. Our initial suggestion to average the 
actual undersampled data with the well sampled smoothing distribution turns 
out to have deep roots in Bayesian inference with the prior Eq. (Pl) ! 

Note the presence of '+1' in Eqs. ( |47| , ||9|). Due to this term the estimated 



distribution is a smoothed version of the counts n(x) even if A ^ 0. In this the- 
ory, no bin has an estimated probability of zero, so that DKLiQ*\\Q) always is 
well defined, and observing the next sample in any bin never is a completely 
unexpected event. This summand, which has so many desirable consequences, 
appears because we define the prior, Eq. (|4.1D, on the space of Q's. Changing the 
variables from probabilities to likelihoods, — 0(x) = \ogQ{x), creates a Jacobian 
which effectively adds one count in every bin. Maximal likelihood estimation of 
parameters does not have this feature, and this is yet another argument in favor 
of the Bayesian formulation. 

Even though this toy model has an exact solution, it still is instructive to per- 
form a saddle point (large A^) analysis in the hope that some useful knowledge 
reusable in more complex settings will be gained. With the usual change of vari- 
ables, 

Q{x) = e-<^(^), Q*{x) = e-**(^), (4.10) 

leaving the A integral aside again, and replacing the 5-function by its Fourier 
representation, we get the following expression for the correlation functions: 

(g(xi)---g(x;v))(^^ = /M^^e-^[<^(^)'^^'^''^*(^')]-S^["(^)+^l*(^), (4.11) 

J Zq{X) 2tc 
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H[<p{x),zfi,X,<P*{x)] = A^Q*(a;)[0(x)-0*(x)]+z/i(Ee"*^'^-l)-(4-12) 

X X 

Calculating up to one-loop corrections using the steepest descent techniques, this 
Hamiltonian results in 

(Q(xi)---g(x7v))^^^ = e"^^«[^^'(^)'^''^*(^)]~S=^["(^')+^l'^^'(^\ (4.13) 

ifeff[0ci(a;),A,0*(a;)] = H[Mx),ifi = N + X + K,X,<j)*{x)] 

K / A^\ 1 
+ log 1 + - +-5:(0*(x)-0ei(a;)), (4.14) 



2 ° V A/ 2 ^ 
Qc,(x) - e-*-'(^) = "(^H ^y W + 1 (4.15) 

Again, just like in Eqs. < ^1[ |A.7D , the fluctuation determinant [the last two terms 
in Eq. ( |4.14|) ] has a different A dependence than the data and the prior terms. 
This suggests, even more clearly than the exact result Eq. d^TTI), that competition 
between them will select a nontrivial most probable value of A*. 

Note that Qestr Eq. (|J:.9D, is equal to Qci- A priori one should not expect them to 
be similar even for large A^. Indeed, the matrix of second derivatives in the saddle 
point calculation has eigenvalues ~ n{x), and these are small for bins with small 
counts. So, in principle, one could expect large discrepancies between the exact 
and the classical solutions. The fact that they are the same inspires a hope that the 
saddle point analysis may remain useful for other, more complex, problems. 

Finally, we want to illustrate yet another important aspect of this toy model. 
What is the predictive information for this system? In general, it is difficult to 
calculate, so we consider two very simple limits: \ <^ N, K <^ N, and 1 ^ A^ ^ 
A, i^ <C A. The first case closely parallels calculations of Section |2.4.2| and yields 



S,{N) ^^ log N. (4.16) 
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In the second case, somewhat lengthy calculations lead to 

S{N) = NS[Q*] + Si{N), lim Si ^ 0, (4.17) 

A''— >oo,A''/A— >0 

where S[Q*] is again the entropy of the reference distribution. These results are 
expected: for small A and large A^ this problem is just learning a .ft'-parameter 
distribution, so Eq. ( |2.5(J| ) should hold. On the other hand, for extremely large A, 
the estimate converges to the reference distribution regardless of the data, so the 
problem is effectively zero dimensional. 

If we do estimates at some large, but fixed. A, then we should see a crossover 
from the zero to the K parameter regime. [] For generic distributions the crossover 
will be smooth, since each bin starts to add its (1/2) log A^ to the predictive in- 
formation independently when the estimate for that bin switches from the ref- 
erence value to the count, i. e., when n{x) > \Q*{x) [cf. Eq. (|4!9|)]. A smooth 
crossover from zero to the logarithm is possible only through a faster than loga- 
rithmic growth for some range of A^. Indeed, from the discussion in Section p.4.6| , 
we know that continuous addition of extra degrees of freedom is a sign of the 
power-law growth of predictive information. It is remarkable that a discrete sys- 
tem as simple as this toy model can exhibit such a rich behavior that, so far, has 
been associated with only very complex nonparametric models; we will see a 
numerical demonstration of this in the next section. 



^If we average over A, and A* {N)/N starts large and drifts to below 1 as A^ increases, then we 
will still observe the crossover. This behavior of A* is possible since at low N most of the weight 
should go to the smoothing term, while at large A^ the actual counts are to be trusted. 
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4.4 Toy model: numerical analysis 

If we want to be able to observe all of the features described in the previous sec- 
tion, then K, the number of bins, should be large enough to allow for a prominent 
prior-dominated (scarce data) region, but it also should be small enough so that 
we can generate enough samples and observe a cross-over to the data-driven 
regime in all bins. The choice oi K = 75, 100, 125 reasonably satisfies both condi- 
tions and will be used in all of our simulations. 

The next important question is the generation of random distributions from 
the prior, Eq. ( |4.1|) . Due to the 5-f unction normalization constraint this is a com- 
plicated task. However, in the limit of large A, -Dkl[Q*||Q] typically is small and 
can be approximated by the x^ distance 



Then the prior, Eq. (|4TD , becomes a multi-variable normal density, and genera- 
tion of random distributions is easy |. 

The choice of A is motivated by the same arguments as the choice of K. The 
asymptotic regime is reached for XK > N, while simulations are time limited 
by A^ ~ 10^. Therefore, we must use A of up to about 1000. On the other hand 
we want to see as much of the prior-enforced learning as possible, so we choose 
to work close to this limit: A = 300, 500, 1000. One may be concerned that these 
large values will produce random distributions that are almost identical to the 
reference, which would make some of our results less interesting. Fig. ( |4.1| ) shows 

^The KL divergence tends to the x^ measure from below. Thus replacing I?kl with x^ pro- 
duces a slightly narrower prior; this difference becomes smaller as A grows. In principle, this can 
produce dramatic changes in statistics, but for our choices of independent parameters this turns 
out to be almost irrelevant. 



4.4. Toy model: numerical analysis 



106 



o 



u.uou 


• data 






0.03 


--- Q=Q* 


line 


•^ 






y '' 


0.025 
0.02 


- 


y 
' ' y 

y m 

, y 


0.015 


- 


■ • v" . 




■ 


" ''y^ 


0.01 
0.005 









y- •• • 


1 1 1 1 1 



0.005 0.01 0.015 0.02 0.025 0.03 0.035 

Q* 

Figure 4.1: Comparison between the reference distribution and a typical random 
one generated with A = 500, and K = 100. Each point corresponds to the values 
(Q*{x), Q{x)) in one of the bins x. 

that these fears are misplaced. The reference and the random distributions are 
similar (as we want them to be), but not excessively so. 



4.4.1 Learning with the correct prior 

Fig. (|4.2D shows the A^ dependence of the universal learning curve A [averaged 
KL divergence, Eqs. ( |2.13| , p3|)], for various combinations of A and K. All of the 
behavior predicted in Section Ol is observed clearly. The learning curves start 
out flat (predictive information and the effective number of parameters are zero) 
and soon enter a continuous series of transitions that add more and more degrees 
of freedom. Finally the behavior enforced by Eq. ( |2.62|) , 



A(iV) 



K 

2iV' 



(4.19) 



is reached. Notice that, in agreement with the claim that the number of active 
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Figure 4.2: A as a function of A, K, and A^ for A = Aq. 

parameters depends on the comparison between the counts and the value of A, 
the asymptotic regime is reached at earlier A^ for smaller values of the parame- 
ter. The agreement with this asymptotic behavior is so remarkably good that we 
chose not to quote any fits: all of them are within the expected errors. 

In the pre-asymptotic regime, learning curves with larger A start lower since 
there the random distributions are very close to the reference and are estimated 
much better by it. Similarly, curves with smaller K also start lower because now 
there are fewer degrees of freedom and, therefore, fewer ways to get a larger KL 
divergence. 



4.4.2 Learning with 'wrong' priors 

As we did for nonparametric distributions in Section p7^ we now want to in- 
vestigate the performance of the learning algorithm on atypical data sequences. 
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Figure 4.3: A as a function of A, Xa, Ql and A^. For all curves K = 100. 



That is, as before, we will differentiate two sets of parameters: A and Q*, which 
encode the expectations of the learning machine, and Xa and Ql, which together 
with Eq. ( |0| ) describe the ensemble of atypical target distributions. Simulations 
related to this question are summarized in Fig. (|4.3D. Here we have shown learn- 
ing with different combinations of A 7^ Xa, and the case A = Aa = 500, Q* = Ql is 
plotted as a reference. Comparing the curves with the corresponding ones from 
the previous Figure, we clearly see that, even though learning with an 'incorrect' 
prior is possible, there are discrepancies: convergence to the asymptotic limit is 
different, and the curves start slightly higher then their 'correct' counterparts. 

Another interesting example shown is when A is 'correct', but the reference 
distribution itself is totally wrong. We see that this type of mistake is much more 
costly: even for A^ as large as 10^ the influence of the wrong reference distribution 
is still strong enough to compromise fast learning. 
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Note the curve with \a = 500, A = 0. This is a case of 'trivial' learning, when 
the only regularization is due to the Jacobian of the Q -^ (p transformation [the 
'+1' term in Eq. (|4.'JD]. If not for it, the estimate would be extremely overfitted 
and would have zeros, and the A would explode. On the other hand, with the 
'+1' correction, A starts out from a constant value, which is the KL divergence 
between the target and the uniform distribution. 

4.4.3 Selecting A with the help of the data 

Having shown how the learning machine performs on expected and unexpected 
data samples, we are now in a position to ask if the Occam factors can select 
the right regularization parameter A as in the case of finite dimensional models 
(MacKay 1992, Balasubramanian 1997) and nonparametric models (Bialek, Cal- 
lan, and Strong 1996, and Section ^ of the present work). This question has 
an affirmative answer, and Fig. ( [1.5D shows A*, the value of A that maximizes 
the correlation function Eq. ( |4.7| ) and minimizes the exponent in Eq. ( |4.13| ). We 
show the results averaged over many runs, even though this form of presentation 
is questionable because A* fluctuates a lot in different trials. These fluctuations 
explain the kinks on the three lower curves. For N to the left of the kinks, there 
are many realizations of the data, for which there is no best value for A, and 
the correlation functions are maximized at A* -^ oo. The kinks appear due to 
the numerically imposed finite cutoff on possible values of the parameter. Apart 
than this, the rest of the learning curves' behavior is as hoped. For ensembles 
generated at \a = 500 and studied at Q* = Ql, A* turns out to be very close to 
500. If Q* 7^ Ql, A* drifts to smaller values, letting the data, not the reference. 
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Figure 4.4: A* for various ensembles of target distributions. 

control the estimate. The same happens for Aa = for any Q* since at this value 
of the parameter the target is, again, far from the reference. 

Finally, we examine the case where the ensemble of the distributions is qual- 
itatively closer to the reference than the prior, Eq. (|0|), allows (this corresponds 
to r] ^ r]a for nonparametric learning). This can be achieved by having a higher 
power of the KL divergence in the exponent of the prior, but such a choice is very 
difficult for numerical simulations. So we take an easier, but less illustrative ex- 
ample of Aa = oo; that is, the target distribution is exactly equal to the reference. 
Here, not surprisingly. A* quickly becomes very large. Again, it often exceeds the 
numerical cutoff, so the average line shown should not be taken to literally. 

Finishing up this section, in Fig. ( [J:.4[ ) we show the learning curves calculated 
at A = A* for all the cases considered above. Comparing to the corresponding 
learning curves in Fig. (|J:.3D, we deduce that learning with an adaptive A is much 
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Figure 4.5: Learning with an adaptive A*. 

faster than with a fixed wrong one, and the example with Aa = c)o is particularly 
demonstrative 0. The only learning curve which starts off slightly worse than its 
fixed A analogue is that for Aq = 500, Q* = Q*- Even though it improves very 
quickly, this once again proves the common knowledge that for small sample 
sizes nothing beats learning with the 'correct' prior. 

4.5 Further work 



The toy example we have investigated resolves the questions it was meant to 
answer. Yet it is just a toy example, and most of its value lies in the extension to 
more difficult problems. 

The most straightforward, but very interesting development is one that has 

■^This remarkable performance for Aa = oo is achieved with the upper cutoff on A*, and it is 
possible that A can fall off even faster without the cutoff. 
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been already mentioned in passing. Reversing the direction of the KL divergence 
in Eq. ( |4.1| ) and choosing a uniform reference distribution Q*, we obtain a prior 
that favors distributions with larger entropies, i. e., the flattest and the most regu- 
lar distributions. Finding the least variable distribution compatible with the data 
is certainly in the spirit of our work and deserves an investigation. 

This idea may be made more sophisticated when the independent variable is 
a vector, x = {x, y}. If the distribution Q{x) is expected to be smooth, but the 
cardinalities of x and y (K^ and Ky respectively) are large, then only the marginal 
distributions Q{x) = J2yQ{x,y) and Q{y) = Y^xQi^iV) ^re sampled well for K^ 
and Ky <^ N r^ K^Ky. To smooth the data we might want to weight the a priori 
probability of Q{x,y) by the entropy S[Q{x,y)]. However, this choice, though 
valid, is not the best. Indeed, the entropy can be small because the marginals are 
very narrow. But in the limit we are interested in, the marginals are well defined 
by the data and do not require separate smoothing. So it is not the entropy of a 
distribution, but its value with respect to the entropy of the marginals that should 
enter a regularizing prior. This is the mutual information J(x, y) between x and 
y, and it is, once again, a meaningful information-theoretic quantity. 

A more ambitious but very appealing direction is to combine these methods 
with the relevant information extraction ideas of Tishby et al. (1999) and Bialek 
and Tishby (in preparation), discussed briefly in Section ^. Recall, that these 
authors proposed to compress (that is, to smooth) the variable x into x so that the 
mutual information I{x, y) remains as close to /(x, y), as possible. There is a one 
parameter family of solutions to the problem, and this parameter measures the 
relative importance of compression (smoothing) and preservation of the informa- 
tion (fit to the data). In many practical applications, choosing the right value of 
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this parameter is a problem. We can view the result of Tishby et al. as a classical 
solution to a problem with some (yet to be defined) prior. One can realistically 
expect that the value of the parameter will, once again, be set by the Occam factor. 

If this theory succeeds, we can use these results to further advance the the- 
ory of learning a nonparametric variable. Derivatives of densities do not have 
any special meaning in the framework of information theory. Using them, as in 
Eq. ( |A.5| ) or its reparameterization invariant version (Periwal 1997, 1998), is thus 
not a preferred regularization. Building priors that include terms with deriva- 
tives of many different orders and fixing coefficients of the terms by requiring 
that the estimate does not depend on the UV details of the data (rounding or 
truncation) Q may work and have a meaning in QFT, but this is an approach alien 
to information theory. Similarly, preferring distributions that are close to their 
filtered version and averaging over the filters afterwards^ has the same prob- 
lem. What we mean by smoothness in any formulation, including nonparametric 
continuous ones, is that the independent, possibly continuous, variable x can be 
successfully coded in some x of finite and small cardinality such that this com- 
pacted version explains the data almost as well as the original one. For example, 
in the finite parameter case, x = a. Developing a theory for smoothing through 
compression would be a great achievement in itself, and within this formalism 
we get the added advantage that the right balance between the goodness of fit 
and the compression will again be determined by the Occam factor. 

This is obviously only a start to an extensive program of generalizations, and 
we hope make a significant progress along these lines in the near future. 



^This Wilsonian renormalization group approach was suggested by V. Periwal. 
^This idea is by W. Bialek. 



Chapter 5 



Conclusion: what have we achieved? 



Let us summarize our achievements and compare them to the promises we made 
and the desires we expressed in the beginning of this work. As promised, we 
built a uniform and universally valid approach to learning by using information 
theory and treating learning as an ability to predict. For this we defined a new 
quantity, the predictive information, and the study of it revealed that it not only 
measures the information relevant to prediction of a time series, but also defines 
uniquely the complexity of the process that generated the series. Statistical me- 
chanics gave an insight on how learning is always annealing in the model space, 
and then we could illuminate numerous connections to statistical learning and 
coding theories and catch some omissions in those. Summarizing, we indeed 
delivered on the promise of a coherent re-treatment of the old knowledge. 

Then we went further and showed that conventional finite parameter models 
are not the only possible scenario. We investigated nonparametric and (under- 
sampled) discrete learning to show that the capacity control mechanisms work 
much better then one might have thought. We showed that the 'information the- 
ory only' approach to learning can be made self consistent and does not need any 
supplemental help to survive. All of these efforts constitute an attempt to build 
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up some new flesh around the core of ideas that are the focal point of our atten- 
tion. This flesh may seem too thin, and it certainly is — a lot more has to be done 
before one can finally say: "The End!" However, one has to start somewhere, and 
we did. 
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Appendix 



A.l Summary of nonparametric learning 

The main problem of statistics — inferring distributions from a finite data set — 
has a wide variety of possible practical applications. Usually, based on some a 
priori considerations, an observer has some finite-dimensional model for the dis- 
tribution being studied. Then the problem reduces to an estimation of a finite 
number of parameters from a large data set, and this is relatively well-studied 



(see, for example, Vapnik 1998, Balasubramanian 1997, and Sections |2.4.1| - pA5 
of the current work). Unfortunately, reducing the problem to a finite number of 
parameters heavily biases the outcome of statistical inference: the true distribu- 
tion may not even be in the chosen family. Thus lately it has become popular to 
look for nonparametric solutions to the problem of learning distributions (recent 
reviews are Dey et al. 1998 and Lemm 1999). As discussed in Section [2.4.5| , non- 
parametric estimations necessarily are prior dependent, i. e., Bayesian. Therefore, 
the result of the inference is a probability distribution of probability distributions, 
which becomes more concentrated as the number of samples increases. Even 
though the result depends on the prior, the prior may be very mildly restrictive 
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(say only some smoothness constraints are assumed), and then the bias is less 
than in any finite parameter setting. 

Recently Bialek, Callan, and Strong presented an elegant formulation that 
casts nonparametric Bayesian learning in the familiar setting of statistical me- 
chanics or, equivalently, Euclidean Quantum Field Theory (QFT) (Bialek, Callan, 
and Strong 1996). This approach and some alternative formulations were further 
developed by Periwal (1997, 1998), Holy (1997), and Aida (1998). In the present 
work, we have utilized heavily the techniques and results of Bialek et al. and ex- 
panded or corrected some of their conclusions. To make our presentation more 
self contained, we here present a brief overview of the theory augmented with 
some comments of our own. 

Following the original reference, if A^ i. i. d. samples {ajj}, i = I . . . N, are ob- 
served, then the probability that a particular density Q{x) gave rise to these data 
is given by the application of the Bayes formula 

pr^. ., . .. _ P[{x,}\Qix)]V[Qix)] _ V[Q{x)]Uf=^Q{x^) 

^^^"^^'^"^^^^ PiM) I[dQ{x)]V[Q{x)]ul,Q{x,) ' ^ ^ 

where V[Q{x)] encodes our a priori expectations of Q. Specifying this prior on a 
space of functions defines a QFT, and the optimal least squares estimator is then 
given by a ratio of correlation functions 

Qest{x\{x,}) = J[dQ{x)]Q{x)P[Q{x\{x^})] (A.2) 

(g(x)Q(xi)Q(x2)...Q(x^))(Q) 

(Q(:ri)Q(x2)...Q(x^))(«) ' ^ • ^ 

where {. . Y^^ means averaging with respect to the prior. Since Q{x) > 0, it is 

convenient to define an unconstrained field 0(a;) 

Q(:^) = fe-^(^), (A.4) 

'0 
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Unlike another possible choice, Q{x) = [(f){x)]'^, (Holy 1997) this definition puts Q 
and (p in one-to-one correspondence. 

The next step is to select a prior that regularizes the infinite number of degrees 
of freedom and allows learning. We require that when we estimate the distribu- 
tion Q{x) the answer must be everywhere finite. Also we want the prior P[0] to 
make sense as a continuous theory, so that the statistics of 0(a;) on large scales 
are not affected, for example, by discretization or round-off errors in a; on small 
scales. This implies that we should look for a renormalization group invariant 
prior (the first steps along this direction were performed in Aida 1998) ^. Simpler, 
but almost equally satisfying, is any ultraviolet (UV) convergent prior. For x in 
one dimension, a minimal choice that is the easiest for theoretical (and, acciden- 
tally, numerical) analysis is 



'P['P{x)] = 2^exp 



i/.. 


[dxj _ 


6 



^ j dxe-^^""^ -I 



(A.5) 



where Z is the normalization constant, and the 5-function enforces normalization 
of Q. The coefficient / defines a scale below which variations in are considered 
to be too rapid, thus we refer to / as the smoothness scale. By making this scale local, 
one may also achieve reparameterization invariance of learned results (Periwal 
1997, 1998). 

The resulting field theory was solved by Bialek et al. up to one-loop correc- 
tions in the limit of large N using standard semiclassical techniques: 

1 ^ 

(g(a;i)---g(a;jv))^^^ ^ -^exp ( - //eff[0ci(a:); {2;^}; /] - ^0ci(a;j)), (A.6) 



^ As noted by V. Periwal in private communication, one may hope to construct a complete the- 
ory of nonparametric learning in many dimensions by choosing a renormalization group compli- 
ant prior. That is, the prior 's (many) parameters have to be defined to change with the renormal- 
ization group flow in such a way that the resulting correlation functions do not depend on the 
cutoff scale, which is in its turn due to round-off, discretization, or filtering. 
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If 1 / A/^\ ^/^ /■ 



TV ^ 

ldl(t)c\{x) + — exp[-0ci(a;)] = ^ (5(x - x^) , (A.8) 

where 0ci is the 'classical' solution to the field theory. In the effective Hamiltonian 
[Eq. (|A.7|) ], the first term is due to the value of the prior at 0ci/ while the second 
one is the infinite dimensional determinant arising from one-loop integration 
over fluctuations around the classical solution. Calculating this determinant is 
the most technically involved step in the solution, and this can be done using 
a standard van Vleck technique (see, for example. Chapter 7 of Coleman 1988). 
This term is a direct analog of Occam factors that appear in finitely parameter- 
izable models (MacKay 1992, Balasubramanian 1997) and allow one to build a 
complexity penalizing razor. 

Using the WKB method, the authors have shown that the solutions [both the 
classical approximation Qci = (V^o) exp(— ^d), and the optimal least squares es- 
timator Qc&tr Eq. ( |A.3D ] are non-singular even at finite N and are essentially self 
consistent averagings of fluctuations (samples) over regions of a (local) size 

e~[VA^Qci(x)]'/'. (A.9) 

It was assumed implicitly that the target distribution P{x) being learned varies 
negligibly over this length scale, and then the WKB method can be used again to 
show that the fluctuations in the estimate, il){x) = 0(x) — [— logP(a;)] behave at 
large N as 

(^(^)) = -j^^dl\ogP{x) + ---, (A.IO) 

/ 4 Mp{x)l 
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If P{x) is not smooth enough, then in both of these equations one must replace 
P by its version smoothed over the local smoothness scale ^ (for more on this 
see Section [i.bp. Note also that the variance of the fluctuations is not uniformly 



small; this is a direct result of parameterization dependence of the prior, Eq. (|A.5| ) 
(Periwal 1998). 

One of the most interesting conjectures of the Bialek et al. (1996) paper is that 
the Occam factor (the fluctuation determinant) is enough to construct a complex- 
ity razor just as in the finite parameter case. Indeed, one may impose an a priori 
distribution on / and average over it after the correlation function, Eq. ( |A.6D , is 



found. The kinetic and the fluctuation determinant terms of the effective Hamil- 
tonian, Eq. ( |A.7D , have opposite dependences on /, so at large N the average 
should be dominated by some /*, for which Hcs is minimal. The data itself should 
select the best smoothness scale consistent with the finite parameter Minimal De- 
scription Length paradigm of Rissanen (1983, 1989) and the Occam's complexity 
razor of MacKay (1992) and Balasubramanian (1997). With the same implicit as- 
sumption of a very smooth P(x) the authors have concluded that 

I* ~ N^/^. (A.12) 

If P{x) is not smooth enough, a different dependence of /* on A^ should be ex- 
pected (cf. Section ^!5|) . 

This approach has a few shortcomings, most of them arising from reparame- 
terization noninvariance and omission of clear identification of the above-men- 
tioned smooth target assumption. Some of these problems are discussed in our 
present work, and some have been analyzed by Periwal (1997, 1998). 
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A.2 Correlation function evaluation 

In the step from Eq. ( |0| ) to Eq. (^J^ we have performed the integral of the fol- 
lowing form 

1= £ dt, dt2--- dtK n ^^ y~^ A ' ^"^-^^^ 

where tj was Q{x), Zj was n{x) + \Q*{x), and the limit of integration for each 
variable is 1 because probabilities are normalized to one. This integral may be 
calculated in a straightforward manner by integrating out each variable in turn. 
This creates a product of B-functions, which can be simply reduced to the final 
result, Eq. ( |A.17D . However, this ease of integration is a consequence of the sim- 
plicity of our toy model. Some other models currently under investigation (see 
Section |4^) involve similar integrals, but they are sufficiently different to prohibit 
easy exact calculations. Keeping these possible applications in mind, we want to 
show a trickier integration method of Eq. ( |A.13D that may be more useful in other 
problems. 

First note that due to the (5-function, which enforces the normalization, only 
tj's less than or equal to 1 matter. So we may equally well replace the upper limits 
of all integrals in Eq. ( |A.13D by +oo. Then we can replace the delta function by 
its Fourier representation, shift the contour of integration by a small e to the right 
[the contours and the directions of the integrations are shown in Fig. dA.lD ], and 
exchange the order of the integrals 

1 = r dh dt2--- dtK n t7 [ — e'^(^-S"-i *0 (A.14) 

Jo j-J-^ ■' Jc'i 2m 

df^ .u^\ r .. .^..-utA (A.15) 



lim/ ^e^w\r dt^fPe-'^'- 



e^0Jc2 27rz .^-^ 

Now, since we shifted the contour, Re/x > 0. Therefore each of the internal 
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Figure A.l: Integration contours. 



integrals in Eq. (|A.15|) is a F function: 



K 

J = TT r(zi + 1) X lim / 



dfi 



C2 27ri Ui /^"^+^ 



(A.16) 



In more complex cases we probably would have stopped here, trading K real 
integrals of Eq. ( |A.13| ) for one contour integration in the complex plane — this is 
why this method may be of use. However, our toy model is very easy, so we 
can proceed further and do the fi integration. Remembering that e is small, we 
can now use the Jordan's lemma and bend the integration contour C2 into C3. 
Then using a well known formula for the integral representation of the inverse 
F-function [Gradshtetn and Ryzhik 1965, Eq. (8.315)] we get 



X 



/■I ^ / ^ \ u^ r(7-\-i) 



,=1 V .=1 / 'nj:f=iz, + K) 



(A.17) 



Note that, in the case of Eq. (fl.6D, the sum of z/s is just A^. 
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